Search Results: "bernat"

1 October 2021

Vincent Bernat: FRnOG #34: how we deployed a datacenter in one click

Here are the slides I presented for FRnOG #34 in October 2021. They are about automating the deployment of Blade s datacenters using Jerikan and Ansible. For more information, have a look at Jerikan+Ansible: a configuration management system for network.
The presentation, in French, was recorded. I have added English subtitles.1

  1. Good thing if you don t understand French as my diction was poor with a lot of fillers.

12 September 2021

Vincent Bernat: Short feedback on Cisco pyATS and Genie Parser

Cisco pyATS is a framework for network automation and testing. It includes, among other things, an open-source multi-vendor set of parsers and models, Genie Parser. It features 2700 parsers for various commands over many network OS. On the paper, this seems a great tool!
>>> from genie.conf.base import Device
>>> device = Device("router", os="iosxr")
>>> # Hack to parse outputs without connecting to a device
>>> device.custom.setdefault("abstraction",  )["order"] = ["os", "platform"]
>>> cmd = "show route ipv4 unicast"
>>> output = """
... Tue Oct 29 21:29:10.924 UTC
...
... O    10.13.110.0/24 [110/2] via 10.12.110.1, 5d23h, GigabitEthernet0/0/0/0.110
... """
>>> device.parse(cmd, output=output)
 'vrf':  'default':  'address_family':  'ipv4':  'routes':  '10.13.110.0/24':  'route': '10.13.110.0/24',
       'active': True,
       'route_preference': 110,
       'metric': 2,
       'source_protocol': 'ospf',
       'source_protocol_codes': 'O',
       'next_hop':  'next_hop_list':  1:  'index': 1,
          'next_hop': '10.12.110.1',
          'outgoing_interface': 'GigabitEthernet0/0/0/0.110',
          'updated': '5d23h' 
First deception: pyATS is closed-source with some exceptions. This is quite annoying if you run into some issues outside Genie Parser. For example, although pyATS is using the ssh command, it cannot leverage my ssh_config file: pyATS resolves hostnames before providing them to ssh. There is no plan to open source pyATS. Then, Genie Parser has two problems:
  1. The data models used are dependent on the vendor and OS, despite the documentation saying otherwise. For example, the data model used for IPv4 interfaces is different between NX-OS and IOS-XR.
  2. The parsers rely on line-by-line regular expressions to extract data and some Python code as glue. This is fragile and may break silently.
To illustrate the second point, let s assume the output of show ipv4 vrf all interface is:
  Loopback10 is Up, ipv4 protocol is Up
    Vrf is default (vrfid 0x60000000)
    Internet protocol processing disabled
  Loopback30 is Up, ipv4 protocol is Down [VRF in FWD reference]
    Vrf is ran (vrfid 0x0)
    Internet address is 203.0.113.17/32
    MTU is 1500 (1500 is available to IP)
    Helper address is not set
    Directed broadcast forwarding is disabled
    Outgoing access list is not set
    Inbound  common access list is not set, access list is not set
    Proxy ARP is disabled
    ICMP redirects are never sent
    ICMP unreachables are always sent
    ICMP mask replies are never sent
    Table Id is 0x0
Because the regular expression to parse an interface name does not expect the extra data after the interface state, Genie Parser ignores the line starting the definition of Loopback30 and parses the output to this structure:1
 
  "Loopback10":  
    "int_status": "up",
    "oper_status": "up",
    "vrf": "ran",
    "vrf_id": "0x0",
    "ipv4":  
      "203.0.113.17/32":  
        "ip": "203.0.113.17",
        "prefix_length": "32"
       ,
      "mtu": 1500,
      "mtu_available": 1500,
      "broadcast_forwarding": "disabled",
      "proxy_arp": "disabled",
      "icmp_redirects": "never sent",
      "icmp_unreachables": "always sent",
      "icmp_replies": "never sent",
      "table_id": "0x0"
     
   
 
While this bug is simple to fix, this is an uphill battle. Any existing or future slight variation in the output of a command could trigger another similar undetected bug, despite the extended test coverage. I have reported and fixed several other silent parsing errors: #516, #529, and #530. A more robust alternative would have been to use TextFSM and to trigger a warning when some output is not recognized, like Batfish, a configuration analysis tool, does. In the future, we should rely on YANG for data extraction, but it is currently not widely supported. SNMP is still a valid possibility but much information is not accessible through this protocol. In the meantime, I would advise you to only use Genie Parser with caution.

  1. As an exercise, the astute reader is asked to write the code to extract the IPv4 from this structure.

6 September 2021

Vincent Bernat: Switching to the i3 window manager

I have been using the awesome window manager for 10 years. It is a tiling window manager, configurable and extendable with the Lua language. Using a general-purpose programming language to configure every aspect is a double-edged sword. Due to laziness and the apparent difficulty of adapting my configuration about 3000 lines to newer releases, I was stuck with the 3.4 version, whose last release is from 2013. It was time for a rewrite. Instead, I have switched to the i3 window manager, lured by the possibility to migrate to Wayland and Sway later with minimal pain. Using an embedded interpreter for configuration is not as important to me as it was in the past: it brings both complexity and brittleness.
i3 dual screen setup
Dual screen desktop running i3, Emacs, some terminals, including a Quake console, Firefox, Polybar as the status bar, and Dunst as the notification daemon.
The window manager is only one part of a desktop environment. There are several options for the other components. I am also introducing them in this post.

i3: the window manager i3 aims to be a minimal tiling window manager. Its documentation can be read from top to bottom in less than an hour. i3 organize windows in a tree. Each non-leaf node contains one or several windows and has an orientation and a layout. This information arbitrates the window positions. i3 features three layouts: split, stacking, and tabbed. They are demonstrated in the below screenshot:
Example of layouts
Demonstration of the layouts available in i3. The main container is split horizontally. The first child is split vertically. The second one is tabbed. The last one is stacking.
Tree representation of the previous screenshot
Tree representation of the previous screenshot.
Most of the other tiling window managers, including the awesome window manager, use predefined layouts. They usually feature a large area for the main window and another area divided among the remaining windows. These layouts can be tuned a bit, but you mostly stick to a couple of them. When a new window is added, the behavior is quite predictable. Moreover, you can cycle through the various windows without thinking too much as they are ordered. i3 is more flexible with its ability to build any layout on the fly, it can feel quite overwhelming as you need to visualize the tree in your head. At first, it is not unusual to find yourself with a complex tree with many useless nested containers. Moreover, you have to navigate windows using directions. It takes some time to get used to. I set up a split layout for Emacs and a few terminals, but most of the other workspaces are using a tabbed layout. I don t use the stacking layout. You can find many scripts trying to emulate other tiling window managers but I did try to get my setup pristine of these tentatives and get a chance to familiarize myself. i3 can also save and restore layouts, which is quite a powerful feature. My configuration is quite similar to the default one and has less than 200 lines.

i3 companion: the missing bits i3 philosophy is to keep a minimal core and let the user implements missing features using the IPC protocol:
Do not add further complexity when it can be avoided. We are generally happy with the feature set of i3 and instead focus on fixing bugs and maintaining it for stability. New features will therefore only be considered if the benefit outweighs the additional complexity, and we encourage users to implement features using the IPC whenever possible. Introduction to the i3 window manager
While this is not as powerful as an embedded language, it is enough for many cases. Moreover, as high-level features may be opinionated, delegating them to small, loosely coupled pieces of code keeps them more maintainable. Libraries exist for this purpose in several languages. Users have published many scripts to extend i3: automatic layout and window promotion to mimic the behavior of other tiling window managers, window swallowing to put a new app on top of the terminal launching it, and cycling between windows with Alt+Tab. Instead of maintaining a script for each feature, I have centralized everything into a single Python process, i3-companion using asyncio and the i3ipc-python library. Each feature is self-contained into a function. It implements the following components:
make a workspace exclusive to an application
When a workspace contains Emacs or Firefox, I would like other applications to move to another workspace, except for the terminal which is allowed to intrude into any workspace. The workspace_exclusive() function monitors new windows and moves them if needed to an empty workspace or to one with the same application already running.
implement a Quake console
The quake_console() function implements a drop-down console available from any workspace. It can be toggled with Mod+ . This is implemented as a scratchpad window.
back and forth workspace switching on the same output
With the workspace back_and_forth command, we can ask i3 to switch to the previous workspace. However, this feature is not restricted to the current output. I prefer to have one keybinding to switch to the workspace on the next output and one keybinding to switch to the previous workspace on the same output. This behavior is implemented in the previous_workspace() function by keeping a per-output history of the focused workspaces.
create a new empty workspace or move a window to an empty workspace
To create a new empty workspace or move a window to an empty workspace, you have to locate a free slot and use workspace number 4 or move container to workspace number 4. The new_workspace() function finds a free number and use it as the target workspace.
restart some services on output change
When adding or removing an output, some actions need to be executed: refresh the wallpaper, restart some components unable to adapt their configuration on their own, etc. i3 triggers an event for this purpose. The output_update() function also takes an extra step to coalesce multiple consecutive events and to check if there is a real change with the low-level library xcffib.
I will detail the other features as this post goes on. On the technical side, each function is decorated with the events it should react to:
@on(CommandEvent("previous-workspace"), I3Event.WORKSPACE_FOCUS)
async def previous_workspace(i3, event):
    """Go to previous workspace on the same output."""
The CommandEvent() event class is my way to send a command to the companion, using either i3-msg -t send_tick or binding a key to a nop command. The latter is used to avoid spawning a shell and a i3-msg process just to send a message. The companion listens to binding events and checks if this is a nop command.
bindsym $mod+Tab nop "previous-workspace"
There are other decorators to avoid code duplication: @debounce() to coalesce multiple consecutive calls, @static() to define a static variable, and @retry() to retry a function on failure. The whole script is a bit more than 1000 lines. I think this is worth a read as I am quite happy with the result.

dunst: the notification daemon Unlike the awesome window manager, i3 does not come with a built-in notification system. Dunst is a lightweight notification daemon. I am running a modified version with HiDPI support for X11 and recursive icon lookup. The i3 companion has a helper function, notify(), to send notifications using DBus. container_info() and workspace_info() uses it to display information about the container or the tree for a workspace.
Notification showing i3 tree for a workspace
Notification showing i3 s tree for a workspace

polybar: the status bar i3 bundles i3bar, a versatile status bar, but I have opted for Polybar. A wrapper script runs one instance for each monitor. The first module is the built-in support for i3 workspaces. To not have to remember which application is running in a workspace, the i3 companion renames workspaces to include an icon for each application. This is done in the workspace_rename() function. The icons are from the Font Awesome project. I maintain a mapping between applications and icons. This is a bit cumbersome but it looks great.
i3 workspaces in Polybar
i3 workspaces in Polybar
For CPU, memory, brightness, battery, disk, and audio volume, I am relying on the built-in modules. Polybar s wrapper script generates the list of filesystems to monitor and they get only displayed when available space is low. The battery widget turns red and blinks slowly when running out of power. Check my Polybar configuration for more details.
Various modules for Polybar
Polybar displaying various information: CPU usage, memory usage, screen brightness, battery status, Bluetooth status (with a connected headset), network status (connected to a wireless network and to a VPN), notification status, and speaker volume.
For Bluetooh, network, and notification statuses, I am using Polybar s ipc module: the next version of Polybar can receive an arbitrary text on an IPC socket. The module is defined with a single hook to be executed at the start to restore the latest status.
[module/network]
type = custom/ipc
hook-0 = cat $XDG_RUNTIME_DIR/i3/network.txt 2> /dev/null
initial = 1
It can be updated with polybar-msg action "#network.send.XXXX". In the i3 companion, the @polybar() decorator takes the string returned by a function and pushes the update through the IPC socket. The i3 companion reacts to DBus signals to update the Bluetooth and network icons. The @on() decorator accepts a DBusSignal() object:
@on(
    StartEvent,
    DBusSignal(
        path="/org/bluez",
        interface="org.freedesktop.DBus.Properties",
        member="PropertiesChanged",
        signature="sa sv as",
        onlyif=lambda args: (
            args[0] == "org.bluez.Device1"
            and "Connected" in args[1]
            or args[0] == "org.bluez.Adapter1"
            and "Powered" in args[1]
        ),
    ),
)
@retry(2)
@debounce(0.2)
@polybar("bluetooth")
async def bluetooth_status(i3, event, *args):
    """Update bluetooth status for Polybar."""
The middle of the bar is occupied by the date and a weather forecast. The latest also uses the IPC mechanism, but the source is a Python script triggered by a timer.
Date and weather in Polybar
Current date and weather forecast for the day in Polybar. The data is retrieved with the OpenWeather API.
I don t use the system tray integrated with Polybar. The embedded icons usually look horrible and they all behave differently. A few years back, Gnome has removed the system tray. Most of the problems are fixed by the DBus-based Status Notifier Item protocol also known as Application Indicators or Ayatana Indicators for GNOME. However, Polybar does not support this protocol. In the i3 companion, The implementation of Bluetooth and network icons, including displaying notifications on change, takes about 200 lines. I got to learn a bit about how DBus works and I get exactly the info I want.

picom: the compositor I like having slightly transparent backgrounds for terminals and to reduce the opacity of unfocused windows. This requires a compositor.1 picom is a lightweight compositor. It works well for me, but it may need some tweaking depending on your graphic card.2 Unlike the awesome window manager, i3 does not handle transparency, so the compositor needs to decide by itself the opacity of each window. Check my configuration for details.

systemd: the service manager I use systemd to start i3 and the various services around it. My xsession script only sets some environment variables and lets systemd handles everything else. Have a look at this article from Micha G ral for the rationale. Notably, each component can be easily restarted and their logs are not mangled inside the ~/.xsession-errors file.3 I am using a two-stage setup: i3.service depends on xsession.target to start services before i3:
[Unit]
Description=X session
BindsTo=graphical-session.target
Wants=autorandr.service
Wants=dunst.socket
Wants=inputplug.service
Wants=picom.service
Wants=pulseaudio.socket
Wants=policykit-agent.service
Wants=redshift.service
Wants=spotify-clean.timer
Wants=ssh-agent.service
Wants=xiccd.service
Wants=xsettingsd.service
Wants=xss-lock.service
Then, i3 executes the second stage by invoking the i3-session.target:
[Unit]
Description=i3 session
BindsTo=graphical-session.target
Wants=wallpaper.service
Wants=wallpaper.timer
Wants=polybar-weather.service
Wants=polybar-weather.timer
Wants=polybar.service
Wants=i3-companion.service
Wants=misc-x.service
Have a look on my configuration files for more details.

rofi: the application launcher Rofi is an application launcher. Its appearance can be customized through a CSS-like language and it comes with several themes. Have a look at my configuration for mine.
Rofi as an application launcher
Rofi as an application launcher
It can also act as a generic menu application. I have a script to control a media player and another one to select the wifi network. It is quite a flexible application.
Rofi as a wifi network selector
Rofi to select a wireless network

xss-lock and i3lock: the screen locker i3lock is a simple screen locker. xss-lock invokes it reliably on inactivity or before a system suspend. For inactivity, it uses the XScreenSaver events. The delay is configured using the xset s command. The locker can be invoked immediately with xset s activate. X11 applications know how to prevent the screen saver from running. I have also developed a small dimmer application that is executed 20 seconds before the locker to give me a chance to move the mouse if I am not away.4 Have a look at my configuration script.
Demonstration of xss-lock, xss-dimmer and i3lock with a 4 speedup.

The remaining components
  • autorandr is a tool to detect the connected display, match them against a set of profiles, and configure them with xrandr.
  • inputplug executes a script for each new mouse and keyboard plugged. This is quite useful to load the appropriate the keyboard map. See my configuration.
  • xsettingsd provides settings to X11 applications, not unlike xrdb but it notifies applications for changes. The main use is to configure the Gtk and DPI settings. See my article on HiDPI support on Linux with X11.
  • Redshift adjusts the color temperature of the screen according to the time of day.
  • maim is a utility to take screenshots. I use Prt Scn to trigger a screenshot of a window or a specific area and Mod+Prt Scn to capture the whole desktop to a file. Check the helper script for details.
  • I have a collection of wallpapers I rotate every hour. A script selects them using advanced machine learning algorithms and stitches them together on multi-screen setups. The selected wallpaper is reused by i3lock.

  1. Apart from the eye candy, a compositor also helps to get tear-free video playbacks.
  2. My configuration works with both Haswell (2014) and Whiskey Lake (2018) Intel GPUs. It also works with AMD GPU based on the Polaris chipset (2017).
  3. You cannot manage two different displays this way e.g. :0 and :1. In the first implementation, I did try to parametrize each service with the associated display, but this is useless: there is only one DBus user session and many services rely on it. For example, you cannot run two notification daemons.
  4. I have only discovered later that XSecureLock ships such a dimmer with a similar implementation. But mine has a cool countdown!

23 August 2021

Vincent Bernat: ThinkPad X1 Carbon (Gen 7): 2 years later

Two years ago, I replaced my ThinkPad X1 Carbon 2014 with the latest generation. The new configuration embeds an Intel Core i7-8565U, 16 Gib of RAM, a 1 Tib NVMe disk, and a WQHD display (2560 1440). I did not ask for a WWAN card. I think it is easier and more reliable to use the wifi hotspot feature of a phone instead: no unreliable firmware and unsupported drivers.1 Here is my opinion on this model.
ThinkPad X1 Carbon 7th Gen with the lid closed
ThinkPad X1 Carbon with its lid closed
While the second generation got a very odd keyboard, this one got a classic one with a full row of function keys. I don t know if my model was defective, but the keyboard skips one keypress from time to time. I have got used to it, but the space key still has a hard time registering when hitting it with my right thumb. The travel course is also shorter and it is less comfortable to type on it than it was on the 2014 version. The trackpoint2 works well. The physical buttons are a welcome addition. I am only using the trackpad for scrolling with the two-finger gesture.
Keyboard of the X1 Carbon 7th Gen
Keyboard with an ANSI QWERTY layout (aka English EU for Lenovo). The LEDs on the speaker and microphone keys work out of the box on Linux.
The screen does not suffer from ghosting effects like the one in my previous ThinkPad. To avoid leaving marks on the screen, I use a piece of cloth on top of the keyboard when closing the laptop. The 720p webcam has a built-in mechanical cover. Its quality is not great, but it does the job.
X1 Carbon 7th Gen
ThinkPad X1 Carbon with its lid opened
Battery life effortlessly reaches about 8 hours on a full charge. This is the main reason this laptop feels like a solid upgrade compared to the previous one: no need to carry a power brick during the day. This is the first Lenovo model with a sound card requiring the Sound Open Firmware. Without the appropriate firmware and the related userspace components (ALSA UCM and PulseAudio), the microphone does not work. If everything is set up correctly, the speakers produce a very decent sound, better than the 2014 model. It should now work out-of-the-box with Debian Bullseye. Just install the firmware-sof-signed package.3 The BIOS can be updated directly from Linux, thanks to the Linux Vendor Firmware Service. I was using the ThinkPad USB-C Dock Gen 2 as a docking station. Everything works out-of-the-box. However, from time to time, I got issues reliably getting an image on the two screens. I was using a couple of 10-year LG monitors with DVI connectors, so I relied on DP HDMI DVI adapters. This may have been the source of some of the problems.
ThinkPad USB-C Dock Gen 2
ThinkPad USB-C Dock Gen 2 with network, keyboard, mouse, power and two screens plugged

In summary, this is a fine laptop plagued with a bad keyboard. I did appreciate the good battery life and the fact there is still one HDMI and two USB-A ports, so I didn t need to travel with dongles. I was disappointed by how small the performance gap was with my 5-year older laptop. I am unsure to get a Lenovo for the next one: HiDPI screens are mostly unavailable and current prices are high. Other possibilities include: Until then, I am using my previous ThinkPad.

  1. The seventh generation moved from Sierra Wireless to Fibocom. While Linux support for modems is good, thanks to ModemManager, they are usually driven over the USB bus. The Fibocom L850-GL can use either the USB bus or the PCI bus but Lenovo s BIOS blacklists the former. It took quite some time to find how to switch it to USB after booting. A PCI driver is in progress.
  2. I did replace the trackpoint with a low-profile one from SaotoTech.
  3. The package is in non-free, despite being open-source: many platforms require the firmware to be signed by Intel.

1 July 2021

Vincent Bernat: Upgrading my desktop PC

I built my current desktop PC in 2014. A second SSD was added in 2015. The motherboard and the power supply were replaced after a fault1 in 2016. The memory was upgraded in 2018. A discrete AMD GPU was installed in 2019 to drive two 4K screens. An NVMe disk was added earlier this year to further increase storage performance. This is a testament to the durability of a desktop PC compared to a laptop: it s evolutive and you can keep it a long time. While fine for most usage, the CPU started to become a bottleneck during video conferences.2 So, it was set for an upgrade. The table below summarizes the change. This update cost me about 800 .
Before After
CPU Intel i5-4670K @ 3.4 GHz AMD Ryzen 5 5600X @ 3.7 GHz
CPU fan Zalman CNPS9900 Noctua NH-U12S
Motherboard Asus Z97-PRO Gamer Asus TUF Gaming B550-PLUS
RAM 2 8 GB + 2 4 GB DDR3 @ 1.6 GHz 2 16 GB DDR4 @ 3.6 GHz
GPU Asus Radeon PH RX 550 4G M7
Disks 500 GB Crucial P2 NVMe
256 GB Samsung SSD 850
256 GB Samsung SSD 840
PSU be quiet! Pure Power CM L8 @ 530 W
Case Antec P100
According to some benchmark, the new CPU should be 4 faster when all cores are used and 1.5 faster for a single-threaded workload. Compiling an arbitrary3 kernel provides a 3 speedup. Before:
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ
  0    0      0    0 0:0:0:0          yes 3800.0000 800.0000
  1    0      0    1 1:1:1:0          yes 3800.0000 800.0000
  2    0      0    2 2:2:2:0          yes 3800.0000 800.0000
  3    0      0    3 3:3:3:0          yes 3800.0000 800.0000
$ CCACHE_DISABLE=1 =time -f '  %E' make -j$(nproc)
[ ]
  OBJCOPY arch/x86/boot/vmlinux.bin
  AS      arch/x86/boot/header.o
  LD      arch/x86/boot/setup.elf
  OBJCOPY arch/x86/boot/setup.bin
  BUILD   arch/x86/boot/bzImage
Kernel: arch/x86/boot/bzImage is ready  (#1)
  4:54.32
After:
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
  0    0      0    0 0:0:0:0          yes 5210.3511 2200.0000
  1    0      0    1 1:1:1:0          yes 4650.2920 2200.0000
  2    0      0    2 2:2:2:0          yes 5210.3511 2200.0000
  3    0      0    3 3:3:3:0          yes 5073.0459 2200.0000
  4    0      0    4 4:4:4:0          yes 4932.1279 2200.0000
  5    0      0    5 5:5:5:0          yes 4791.2100 2200.0000
  6    0      0    0 0:0:0:0          yes 5210.3511 2200.0000
  7    0      0    1 1:1:1:0          yes 4650.2920 2200.0000
  8    0      0    2 2:2:2:0          yes 5210.3511 2200.0000
  9    0      0    3 3:3:3:0          yes 5073.0459 2200.0000
 10    0      0    4 4:4:4:0          yes 4932.1279 2200.0000
 11    0      0    5 5:5:5:0          yes 4791.2100 2200.0000
$ CCACHE_DISABLE=1 =time -f '  %E' make -j$(nproc)
[ ]
  OBJCOPY arch/x86/boot/vmlinux.bin
  AS      arch/x86/boot/header.o
  LD      arch/x86/boot/setup.elf
  OBJCOPY arch/x86/boot/setup.bin
  BUILD   arch/x86/boot/bzImage
Kernel: arch/x86/boot/bzImage is ready  (#1)
  1:40.18
Here we go for another seven years!

  1. The original power supply was from an older configuration. It suddenly became unable to reliably start the PC. The motherboard got replaced as it was the first suspect: without load, the power supply was working correctly.
  2. On Linux, many programs are unable to leverage hardware acceleration. This is a pity. On a laptop, this can also draws the battery pretty fast.
  3. The kernel is configured with make defconfig on commit 15fae3410f1d.

10 June 2021

Petter Reinholdtsen: Nikita version 0.6 released - free software archive API server

I am very pleased to be able to share with you the announcement of a new version of the archiving system Nikita published by its lead developer Thomas S dring:
It is with great pleasure that we can announce a new release of nikita. Version 0.6 (https://gitlab.com/OsloMet-ABI/nikita-noark5-core). This release makes new record keeping functionality available. This really is a maturity release. Both in terms of functionality but also code. Considerable effort has gone into refactoring the codebase and simplifying the code. Notable changes for this release include:
  • Significantly improved OData parsing
  • Support for business specific metadata and national identifiers
  • Continued implementation of domain model and endpoints
  • Improved testing
  • Ability to export and import from arkivstruktur.xml
We are currently in the process of reaching an agreement with an archive institution to publish their picture archive using nikita with business specific metadata and we hope that we can share this with you soon. This is an interesting project as it allows the organisation to bring an older picture archive back to life while using the original metadata values stored as business specific metadata. Combined with OData means the scope and use of the archive is significantly increased and will showcase both the flexibility and power of Noark. I really think we are approaching a version 1.0 of nikita, even though there is still a lot of work to be done. The notable work at the moment is to implement access-control and full text indexing of documents. My sincere thanks to everyone who has contributed to this release! - Thomas Release 0.6 2021-06-10 (d1ba5fc7e8bad0cfdce45ac20354b19d10ebbc7b)
  • Refactor metadata entity search
  • Remove redundant security configuration
  • Make OpenAPI documentation work
  • Change database structure / inheritance model to a more sensible approach
  • Make it possible to move entities around the fonds structure
  • Implemented a number of missing endpoints
  • Make sure yml files are in sync
  • Implemented/finalised storing and use of
    • Business Specific Metadata
    • Norwegian National Identifiers
    • Cross Reference
    • Keyword
    • StorageLocation
    • Author
    • Screening for relevant objects
    • ChangeLog
    • EventLog
  • Make generation of updated docker image part of successful CI pipeline
  • Implement pagination for all list requests
    • Refactor code to support lists
    • Refactor code for readability
    • Standardise the controller/service code
  • Finalise File->CaseFile expansion and Record->registryEntry/recordNote expansion
  • Improved Continuous Integration (CI) approach via gitlab
  • Changed conversion approach to generate tagged PDF documents
  • Updated dependencies
    • For security reasons
    • Brought codebase to spring-boot version 2.5.0
    • Remove import of necessary dependencies
    • Remove non-used metrics classes
  • Added new analysis to CI including
  • Implemented storing of Keyword
  • Implemented storing of Screening and ScreeningMetadata
  • Improved OData support
    • Better support for inheritance in queries where applicable
    • Brought in more OData tests
    • Improved OData/hibernate understanding of queries
    • Implement $count, $orderby
    • Finalise $top and $skip
    • Make sure & is used between query parameters
  • Improved Testing in codebase
    • A new approach for integration tests to make test more readable
    • Introduce tests in parallel with code development for TDD approach
    • Remove test that required particular access to storage
  • Implement case-handling process from received email to case-handler
    • Develop required GUI elements (digital postroom from email)
    • Introduced leader, quality control and postroom roles
  • Make PUT requests return 200 OK not 201 CREATED
  • Make DELETE requests return 204 NO CONTENT not 200 OK
  • Replaced 'oppdatert*' with 'endret*' everywhere to match latest spec
  • Upgrade Gitlab CI to use python > 3 for CI scripts
  • Bug fixes
    • Fix missing ALLOW
    • Fix reading of objects from jar file during start-up
    • Reduce the number of warnings in the codebase
    • Fix delete problems
    • Make better use of cascade for "leaf" objects
    • Add missing annotations where relevant
    • Remove the use of ETAG for delete
    • Fix missing/wrong/broken rels discovered by runtest
    • Drop unofficial convertFil (konverterFil) end point
    • Fix regex problem for dateTime
    • Fix multiple static analysis issues discovered by coverity
    • Fix proxy problem when looking for object class names
    • Add many missing translated Norwegian to English (internal) attribute/entity names
    • Change UUID generation approach to allow code also set a value
    • Fix problem with Part/PartParson
    • Fix problem with empty OData search results
    • Fix metadata entity domain problem
  • General Improvements
    • Makes future refactoring easier as coupling is reduced
    • Allow some constant variables to be set from property file
    • Refactor code to make reflection work better across codebase
    • Reduce the number of @Service layer classes used in @Controller classes
    • Be more consistent on naming of similar variable types
    • Start printing rels/href if they are applicable
    • Cleaner / standardised approach to deleting objects
    • Avoid concatenation when using StringBuilder
    • Consolidate code to avoid duplication
    • Tidy formatting for a more consistent reading style across similar class files
    • Make throw a log.error message not an log.info message
    • Make throw print the log value rather than printing in multiple places
    • Add some missing pronom codes
    • Fix time formatting issue in Gitlab CI
    • Remove stale / unused code
    • Use only UUID datatype rather than combination String/UUID for systemID
    • Mark variables final and @NotNull where relevant to indicate intention
  • Change Date values to DateTime to maintain compliance with Noark 5 standard
  • Domain model improvements using Hypersistence Optimizer
    • Move @Transactional from class to methods to avoid borrowing the JDBC Connection unnecessarily
    • Fix OneToOne performance issues
    • Fix ManyToMany performance issues
    • Add missing bidirectional synchronization support
    • Fix ManyToMany performance issue
  • Make List
  • and Set
  • use final-keyword to avoid potential problems during update operations
  • Changed internal URLs, replaced "hateoas-api" with "api".
  • Implemented storing of Precedence.
  • Corrected handling of screening.
  • Corrected _links collection returned for list of mixed entity types to match the specific entity.
  • Improved several internal structures.
If free and open standardized archiving API sound interesting to you, please contact us on IRC (#nikita on irc.oftc.net) or email (nikita-noark mailing list). As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Vincent Bernat: Serving WebP & AVIF images with Nginx

WebP and AVIF are two image formats for the web. They aim to produce smaller files than JPEG and PNG. They both support lossy and lossless compression, as well as alpha transparency. WebP was developed by Google and is a derivative of the VP8 video format.1 It is supported on most browsers. AVIF is using the newer AV1 video format to achieve better results. It is supported by Chromium-based browsers and has experimental support for Firefox.2

Your browser supports WebP and AVIF image formats. Your browser supports none of these image formats. Your browser only supports the WebP image format. Your browser only supports the AVIF image format.

Without JavaScript, I can t tell what your browser supports.

Converting and optimizing images For this blog, I am using the following shell snippets to convert and optimize JPEG and PNG images. Skip to the next section if you are only interested in the Nginx setup.

JPEG images JPEG images are converted to WebP using cwebp.
find media/images -type f -name '*.jpg' -print0 \
    xargs -0n1 -P$(nproc) -i \
      cwebp -q 84 -af ' ' -o ' '.webp
They are converted to AVIF using avifenc from libavif:
find media/images -type f -name '*.jpg' -print0 \
    xargs -0n1 -P$(nproc) -i \
      avifenc --codec aom --yuv 420 --min 20 --max 25 ' ' ' '.avif
Then, they are optimized using jpegoptim built with Mozilla s improved JPEG encoder, via Nix. This is one reason I love Nix.
jpegoptim=$(nix-build --no-out-link \
      -E 'with (import <nixpkgs> ); jpegoptim.override   libjpeg = mozjpeg;  ')
find media/images -type f -name '*.jpg' -print0 \
    sort -z
    xargs -0n10 -P$(nproc) \
      $ jpegoptim /bin/jpegoptim --max=84 --all-progressive --strip-all

PNG images PNG images are down-sampled to 8-bit RGBA-palette using pngquant. The conversion reduces file sizes significantly while being mostly invisible.
find media/images -type f -name '*.png' -print0 \
    sort -z
    xargs -0n10 -P$(nproc) \
      pngquant --skip-if-larger --strip \
               --quiet --ext .png --force
Then, they are converted to WebP with cwebp in lossless mode:
find media/images -type f -name '*.png' -print0 \
    xargs -0n1 -P$(nproc) -i \
      cwebp -z 8 ' ' -o ' '.webp
No conversion is done to AVIF: lossless compression is not as efficient as pngquant and lossy compression is only marginally better than what I get with WebP.

Keeping only the smallest files I am only keeping WebP and AVIF images if they are at least 10% smaller than the original format: decoding is usually faster for JPEG and PNG; and JPEG images can be decoded progressively.3
for f in media/images/**/*. webp,avif ; do
  orig=$(stat --format %s $ f%.* )
  new=$(stat --format %s $f)
  (( orig*0.90 > new ))   rm $f
done
I only keep AVIF images if they are smaller than WebP.
for f in media/images/**/*.avif; do
  [[ -f $ f%.* .webp ]]   continue
  orig=$(stat --format %s $ f%.* .webp)
  new=$(stat --format %s $f)
  (( $orig > $new ))   rm $f
done
We can compare how many images are kept when converted to WebP or AVIF:
printf "     %10s %10s %10s\n" Original WebP AVIF
for format in png jpg; do
  printf " $ format:u  %10s %10s %10s\n" \
    $(find media/images -name "*.$format"   wc -l) \
    $(find media/images -name "*.$format.webp"   wc -l) \
    $(find media/images -name "*.$format.avif"   wc -l)
done
AVIF is better than MozJPEG for most JPEG files while WebP beats MozJPEG only for one file out of two:
       Original       WebP       AVIF
 PNG         64         47          0
 JPG         83         40         74

Further reading I didn t detail my choices for quality parameters and there is not much science in it. Here are two resources providing more insight on AVIF:

Serving WebP & AVIF with Nginx To serve WebP and AVIF images, there are two possibilities:
  1. use <picture> to let the browser pick the format it supports, or
  2. use content negotiation to let the server send the best-supported format.
I use the second approach. It relies on inspecting the Accept HTTP header in the request. For Chrome, it looks like this:
Accept: image/avif,image/webp,image/apng,image/*,*/*;q=0.8
I configure Nginx to serve AVIF image, then the WebP image, and fallback to the original JPEG/PNG image depending on what the browser advertises:4
http  
  map $http_accept $webp_suffix  
    default        "";
    "~image/webp"  ".webp";
   
  map $http_accept $avif_suffix  
    default        "";
    "~image/avif"  ".avif";
   
 
server  
  # [ ]
  location ~ ^/images/.*\.(png jpe?g)$  
    add_header Vary Accept;
    try_files $uri$avif_suffix$webp_suffix $uri$avif_suffix $uri$webp_suffix $uri =404;
   
 
For example, let s suppose the browser requests /images/ont-box-orange@2x.jpg. If it supports WebP but not AVIF, $webp_suffix is set to .webp while $avif_suffix is set to the empty string. The server tries to serve the first existing file in this list:
  • /images/ont-box-orange@2x.jpg.webp
  • /images/ont-box-orange@2x.jpg
  • /images/ont-box-orange@2x.jpg.webp
  • /images/ont-box-orange@2x.jpg
If the browser supports both AVIF and WebP, Nginx walks the following list:
  • /images/ont-box-orange@2x.jpg.webp.avif (it never exists)
  • /images/ont-box-orange@2x.jpg.avif
  • /images/ont-box-orange@2x.jpg.webp
  • /images/ont-box-orange@2x.jpg
Eugene Lazutkin explains in more detail how this works. I have only presented a variation of his setup supporting both WebP and AVIF.

  1. VP8 is only used for lossy compression. Lossless compression is using an unrelated format.
  2. Firefox support was scheduled for Firefox 86 but because of the lack of proper color space support, it is still not enabled by default.
  3. Progressive decoding is not planned for WebP but could be implemented using low-quality thumbnail images for AVIF. See this issue for a discussion.
  4. The Vary header ensures an intermediary cache (a proxy or a CDN) checks the Accept header before using a cached response. Internet Explorer has trouble with this header and may not be able to cache the resource properly. There is a workaround but Internet Explorer s market share is now so small that it is pointless to implement it.

25 May 2021

Vincent Bernat: Jerikan+Ansible: a configuration management system for network

There are many resources for network automation with Ansible. Most of them only expose the first steps or limit themselves to a narrow scope. They give no clue on how to expand from that. Real network environments may be large, versatile, heterogeneous, and filled with exceptions. The lack of real-world examples for Ansible deployments, unlike Puppet and SaltStack, leads many teams to build brittle and incomplete automation solutions. We have released under an open-source license our attempt to tackle this problem: Here is a quick demo to configure a new peering:
This work is the collective effort of C dric Hasco t, Jean-Christophe Legatte, Lo c Pailhas, S bastien Hurtel, Tchadel Icard, and Vincent Bernat. We are the network team of Blade, a French company operating Shadow, a cloud-computing product. In May 2021, our company was bought by Octave Klaba and the infrastructure is being transferred to OVHcloud, saving Shadow as a product, but making our team redundant. Our network was around 800 devices, spanning over 10 datacenters with more than 2.5 Tbps of available egress bandwidth. The released material is therefore a substantial example of managing a medium-scale network using Ansible. We have left out the handling of our legacy datacenters to make the final result more readable while keeping enough material to not turn it into a trivial example.

Jerikan The first component is Jerikan. As input, it takes a list of devices, configuration data, templates, and validation scripts. It generates a set of configuration files for each device. Ansible could cover this task, but it has the following limitations:
  • it is slow;
  • errors are difficult to debug;1 and
  • the hierarchy to look up a variable is rigid.
Jerikan inputs and outputs
Jerikan inputs and outputs
If you want to follow the examples, you only need to have Docker and Docker Compose installed. Clone the repository and you are ready!

Source of truth We use YAML files, versioned with Git, as the single source of truth instead of using a database, like NetBox, or a mix of a database and text files. This provides many advantages:
  • anyone can use their preferred text editor;
  • the team prepares changes in branches;
  • the team reviews changes using merge requests;
  • the merge requests expose the changes to the generated configuration files;
  • rollback to a previous state is easy; and
  • it is fast.
The first file is devices.yaml. It contains the device list. The second file is classifier.yaml. It defines a scope for each device. A scope is a set of keys and values. It is used in templates and to look up data associated with a device.
$ ./run-jerikan scope to1-p1.sk1.blade-group.net
continent: apac
environment: prod
groups:
- tor
- tor-bgp
- tor-bgp-compute
host: to1-p1.sk1
location: sk1
member: '1'
model: dell-s4048
os: cumulus
pod: '1'
shorthost: to1-p1
The device name is matched against a list of regular expressions and the scope is extended by the result of each match. For to1-p1.sk1.blade-group.net, the following subset of classifier.yaml defines its scope:
matchers:
  - '^(([^.]*)\..*)\.blade-group\.net':
      environment: prod
      host: '\1'
      shorthost: '\2'
  - '\.(sk1)\.':
      location: '\1'
      continent: apac
  - '^to([12])-[as]?p(\d+)\.':
      member: '\1'
      pod: '\2'
  - '^to[12]-p\d+\.':
      groups:
        - tor
        - tor-bgp
        - tor-bgp-compute
  - '^to[12]-(p ap)\d+\.sk1\.':
      os: cumulus
      model: dell-s4048
The third file is searchpaths.py. It describes which directories to search for a variable. A Python function provides a list of paths to look up in data/ for a given scope. Here is a simplified version:2
def searchpaths(scope):
    paths = [
        "host/ scope[location] / scope[shorthost] ",
        "location/ scope[location] ",
        "os/ scope[os] - scope[model] ",
        "os/ scope[os] ",
        'common'
    ]
    for idx in range(len(paths)):
        try:
            paths[idx] = paths[idx].format(scope=scope)
        except KeyError:
            paths[idx] = None
    return [path for path in paths if path]
With this definition, the data for to1-p1.sk1.blade-group.net is looked up in the following paths:
$ ./run-jerikan scope to1-p1.sk1.blade-group.net
[ ]
Search paths:
  host/sk1/to1-p1
  location/sk1
  os/cumulus-dell-s4048
  os/cumulus
  common
Variables are scoped using a namespace that should be specified when doing a lookup. We use the following ones:
  • system for accounts, DNS, syslog servers,
  • topology for ports, interfaces, IP addresses, subnets,
  • bgp for BGP configuration
  • build for templates and validation scripts
  • apps for application variables
When looking up for a variable in a given namespace, Jerikan looks for a YAML file named after the namespace in each directory in the search paths. For example, if we look up a variable for to1-p1.sk1.blade-group.net in the bgp namespace, the following YAML files are processed: host/sk1/to1-p1/bgp.yaml, location/sk1/bgp.yaml, os/cumulus-dell-s4048/bgp.yaml, os/cumulus/bgp.yaml, and common/bgp.yaml. The search stops at the first match. The schema.yaml file allows us to override this behavior by asking to merge dictionaries and arrays across all matching files. Here is an excerpt of this file for the topology namespace:
system:
  users:
    merge: hash
  sampling:
    merge: hash
  ansible-vars:
    merge: hash
  netbox:
    merge: hash
The last feature of the source of truth is the ability to use Jinja2 templates for keys and values by prefixing them with ~ :
# In data/os/junos/system.yaml
netbox:
  manufacturer: Juniper
  model: "~  model upper  "
# In data/groups/tor-bgp-compute/system.yaml
netbox:
  role: net_tor_gpu_switch
Looking up for netbox in the system namespace for to1-p2.ussfo03.blade-group.net yields the following result:
$ ./run-jerikan scope to1-p2.ussfo03.blade-group.net
continent: us
environment: prod
groups:
- tor
- tor-bgp
- tor-bgp-compute
host: to1-p2.ussfo03
location: ussfo03
member: '1'
model: qfx5110-48s
os: junos
pod: '2'
shorthost: to1-p2
[ ]
Search paths:
[ ]
  groups/tor-bgp-compute
[ ]
  os/junos
  common
$ ./run-jerikan lookup to1-p2.ussfo03.blade-group.net system netbox
manufacturer: Juniper
model: QFX5110-48S
role: net_tor_gpu_switch
This also works for structured data:
# In groups/adm-gateway/topology.yaml
interface-rescue:
  address: "~  lookup('topology', 'addresses').rescue  "
  up:
    - "~ip route add default via   lookup('topology', 'addresses').rescue ipaddr('first_usable')   table rescue"
    - "~ip rule add from   lookup('topology', 'addresses').rescue ipaddr('address')   table rescue priority 10"
# In groups/adm-gateway-sk1/topology.yaml
interfaces:
  ens1f0: "~  lookup('topology', 'interface-rescue')  "
This yields the following result:
$ ./run-jerikan lookup gateway1.sk1.blade-group.net topology interfaces
[ ]
ens1f0:
  address: 121.78.242.10/29
  up:
  - ip route add default via 121.78.242.9 table rescue
  - ip rule add from 121.78.242.10 table rescue priority 10
When putting data in the source of truth, we use the following rules:
  1. Don t repeat yourself.
  2. Put the data in the most specific place without breaking the first rule.
  3. Use templates with parsimony, mostly to help with the previous rules.
  4. Restrict the data model to what is needed for your use case.
The first rule is important. For example, when specifying IP addresses for a point-to-point link, only specify one side and deduce the other value in the templates. The last rule means you do not need to mimic a BGP YANG model to specify BGP peers and policies:
peers:
  transit:
    cogent:
      asn: 174
      remote:
        - 38.140.30.233
        - 2001:550:2:B::1F9:1
      specific-import:
        - name: ATT-US
          as-path: ".*7018$"
          lp-delta: 50
  ix-sfmix:
    rs-sfmix:
      monitored: true
      asn: 63055
      remote:
        - 206.197.187.253
        - 206.197.187.254
        - 2001:504:30::ba06:3055:1
        - 2001:504:30::ba06:3055:2
    blizzard:
      asn: 57976
      remote:
        - 206.197.187.42
        - 2001:504:30::ba05:7976:1
      irr: AS-BLIZZARD

Templates The list of templates to compile for each device is stored in the source of truth, under the build namespace:
$ ./run-jerikan lookup edge1.ussfo03.blade-group.net build templates
data.yaml: data.j2
config.txt: junos/main.j2
config-base.txt: junos/base.j2
config-irr.txt: junos/irr.j2
$ ./run-jerikan lookup to1-p1.ussfo03.blade-group.net build templates
data.yaml: data.j2
config.txt: cumulus/main.j2
frr.conf: cumulus/frr.j2
interfaces.conf: cumulus/interfaces.j2
ports.conf: cumulus/ports.j2
dhcpd.conf: cumulus/dhcp.j2
default-isc-dhcp: cumulus/default-isc-dhcp.j2
authorized_keys: cumulus/authorized-keys.j2
motd: linux/motd.j2
acl.rules: cumulus/acl.j2
rsyslog.conf: cumulus/rsyslog.conf.j2
Templates are using Jinja2. This is the same engine used in Ansible. Jerikan ships some custom filters but also reuse some of the useful filters from Ansible, notably ipaddr. Here is an excerpt of templates/junos/base.j2 to configure DNS and NTP servers on Juniper devices:
system  
  ntp  
 % for ntp in lookup("system", "ntp") % 
    server   ntp  ;
 % endfor % 
   
  name-server  
 % for dns in lookup("system", "dns") % 
      dns  ;
 % endfor % 
   
 
The equivalent template for Cisco IOS-XR is:
 % for dns in lookup('system', 'dns') % 
domain vrf VRF-MANAGEMENT name-server   dns  
 % endfor % 
!
 % for syslog in lookup('system', 'syslog') % 
logging   syslog   vrf VRF-MANAGEMENT
 % endfor % 
!
There are three helper functions provided:
  • devices() returns the list of devices matching a set of conditions on the scope. For example, devices("location==ussfo03", "groups==tor-bgp") returns the list of devices in San Francisco in the tor-bgp group. You can also omit the operator if you want the specified value to be equal to the one in the local scope. For example, devices("location") returns devices in the current location.
  • lookup() does a key lookup. It takes the namespace, the key, and optionally, a device name. If not provided, the current device is assumed.
  • scope() returns the scope of the provided device.
Here is how you would define iBGP sessions between edge devices in the same location:
 % for neighbor in devices("location", "groups==edge") if neighbor != device % 
   % for address in lookup("topology", "addresses", neighbor).loopback tolist % 
protocols bgp group IPV  address ipv  -EDGES-IBGP  
  neighbor   address    
    description "IPv  address ipv  : iBGP to   neighbor  ";
   
 
   % endfor % 
 % endfor % 
We also have a global key-value store to save information to be reused in another template or device. This is quite useful to automatically build DNS records. First, capture the IP address inserted into a template with store() as a filter:
interface Loopback0
 description 'Loopback:'
  % for address in lookup('topology', 'addresses').loopback tolist % 
 ipv  address ipv   address   address store('addresses', 'Loopback0') ipaddr('cidr')  
  % endfor % 
!
Then, reuse it later to build DNS records by iterating over store():4
 % for device, ip, interface in store('addresses') % 
   % set interface = interface replace('/', '-') replace('.', '-') replace(':', '-') % 
   % set name = ' . '.format(interface lower, device) % 
  name  . IN   'A' if ip ipv4 else 'AAAA'     ip ipaddr('address')  
 % endfor % 
Templates are compiled locally with ./run-jerikan build. The --limit argument restricts the devices to generate configuration files for. Build is not done in parallel because a template may depend on the data collected by another template. Currently, it takes 1 minute to compile around 3000 files spanning over 800 devices.
Jerikan outputs when building templates
Output of Jerikan after building configuration files for six devices
When an error occurs, a detailed traceback is displayed, including the template name, the line number and the value of all visible variables. This is a major time-saver compared to Ansible!
templates/opengear/config.j2:15: in top-level template code
    config.interfaces.  interface  .netmask   adddress   ipaddr("netmask")  
        continent  = 'us'
        device     = 'con1-ag2.ussfo03.blade-group.net'
        environment = 'prod'
        host       = 'con1-ag2.ussfo03'
        infos      =  'address': '172.30.24.19/21' 
        interface  = 'wan'
        location   = 'ussfo03'
        loop       = <LoopContext 1/2>
        member     = '2'
        model      = 'cm7132-2-dac'
        os         = 'opengear'
        shorthost  = 'con1-ag2'
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
value = JerkianUndefined, query = 'netmask', version = False, alias = 'ipaddr'
[ ]
        # Check if value is a list and parse each element
        if isinstance(value, (list, tuple, types.GeneratorType)):
            _ret = [ipaddr(element, str(query), version) for element in value]
            return [item for item in _ret if item]
>       elif not value or value is True:
E       jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'adddress'
We don t have general-purpose rules when writing templates. Like for the source of truth, there is no need to create generic templates able to produce any BGP configuration. There is a balance to be found between readability and avoiding duplication. Templates can become scary and complex: sometimes, it s better to write a filter or a function in jerikan/jinja.py. Mastering Jinja2 is a good investment. Take time to browse through our templates as some of them show interesting features.

Checks Optionally, each configuration file can be validated by a script in the checks/ directory. Jerikan looks up the key checks in the build namespace to know which checks to run:
$ ./run-jerikan lookup edge1.ussfo03.blade-group.net build checks
- description: Juniper configuration file syntax check
  script: checks/junoser
  cache:
    input: config.txt
    output: config-set.txt
- description: check YAML data
  script: checks/data.yaml
  cache: data.yaml
In the above example, checks/junoser is executed if there is a change to the generated config.txt file. It also outputs a transformed version of the configuration file which is easier to understand when using diff. Junoser checks a Junos configuration file using Juniper s XML schema definition for Netconf.5 On error, Jerikan displays:
jerikan/build.py:127: RuntimeError
-------------- Captured syntax check with Junoser call --------------
P: checks/junoser edge2.ussfo03.blade-group.net
C: /app/jerikan
O:
E: Invalid syntax:  set system syslog archive size 10m files 10 word-readable
S: 1

Integration into GitLab CI The next step is to compile the templates using a CI. As we are using GitLab, Jerikan ships with a .gitlab-ci.yml file. When we need to make a change, we create a dedicated branch and a merge request. GitLab compiles the templates using the same environment we use on our laptops and store them as an artifact.
Merge requests in GitLab for a change
Merge request to add a new port in USSFO03. The templates were compiled successfully but approval from another team member is still required to merge.
Before approving the merge request, another team member looks at the changes in data and templates but also the differences for the generated configuration files:
Differences for generated configuration files
The change configures a port on the Juniper device, adds records to DNS, and updates NetBox with the new IP addresses (not shown).

Ansible After Jerikan has built the configuration files, Ansible takes over. It is also packaged as a Docker image to avoid the trouble to maintain the right Python virtual environment and ensure everyone is using the same versions.

Inventory Jerikan has generated an inventory file. It contains all the managed devices, the variables defined for each of them and the groups converted to Ansible groups:
ob1-n1.sk1.blade-group.net ansible_host=172.29.15.12 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
ob2-n1.sk1.blade-group.net ansible_host=172.29.15.13 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
ob1-n1.ussfo03.blade-group.net ansible_host=172.29.15.12 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
none ansible_connection=local
[oob]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[os-ios]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[model-c2960s]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[location-sk1]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
[location-ussfo03]
ob1-n1.ussfo03.blade-group.net
[in-sync]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
none
in-sync is a special group for devices which configuration should match the golden configuration. Daily and unattended, Ansible should be able to push configurations to this group. The mid-term goal is to cover all devices. none is a special device for tasks not related to a specific host. This includes synchronizing NetBox, IRR objects, and the DNS, updating the RPKI, and building the geofeed files.

Playbook We use a single playbook for all devices. It is described in the ansible/playbooks/site.yaml file. Here is a shortened version:
- hosts: adm-gateway:!done
  strategy: mitogen_linear
  roles:
    - blade.linux
    - blade.adm-gateway
    - done
- hosts: os-linux:!done
  strategy: mitogen_linear
  roles:
    - blade.linux
    - done
- hosts: os-junos:!done
  gather_facts: false
  roles:
    - blade.junos
    - done
- hosts: os-opengear:!done
  gather_facts: false
  roles:
    - blade.opengear
    - done
- hosts: none:!done
  gather_facts: false
  roles:
    - blade.none
    - done
A host executes only one of the play. For example, a Junos device executes the blade.junos role. Once a play has been executed, the device is added to the done group and the other plays are skipped. The playbook can be executed with the configuration files generated by the GitLab CI using the ./run-ansible-gitlab command. This is a wrapper around Docker and the ansible-playbook command and it accepts the same arguments. To deploy the configuration on the edge devices for the SK1 datacenter in check mode, we use:
$ ./run-ansible-gitlab playbooks/site.yaml --limit='edge:&location-sk1' --diff --check
[ ]
PLAY RECAP *************************************************************
edge1.sk1.blade-group.net  : ok=6    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0
edge2.sk1.blade-group.net  : ok=5    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0
We have some rules when writing roles:
  • --check must detect if a change is needed;
  • --diff must provide a visualization of the planned changes;
  • --check and --diff must not display anything if there is nothing to change;
  • writing a custom module tailored to our needs is a valid solution;
  • the whole device configuration is managed;6
  • secrets must be stored in Vault;
  • templates should be avoided as we have Jerikan for that; and
  • avoid duplication and reuse tasks.7
We avoid using collections from Ansible Galaxy, the exception being collections to connect and interact with vendor devices, like cisco.iosxr collection. The quality of Ansible Galaxy collections is quite random and it is an additional maintenance burden. It seems better to write roles tailored to our needs. The collections we use are in ci/ansible/ansible-galaxy.yaml. We use Mitogen to get a 10 speedup on Ansible executions on Linux hosts. We also have a few playbooks for operational purpose: upgrading the OS version, isolate an edge router, etc. We were also planning on how to add operational checks in roles: are all the BGP sessions up? They could have been used to validate a deployment and rollback if there is an issue. Currently, our playbooks are run from our laptops. To keep tabs, we are using ARA. A weekly dry-run on devices in the in-sync group also provides a dashboard on which devices we need to run Ansible on.

Configuration data and templates Jerikan ships with pre-populated data and templates matching the configuration of our USSFO03 and SK1 datacenters. They do not exist anymore but, we promise, all this was used in production back in the days!
Network architecture for Blade datacenter
The latest iteration of our network infrastructure for SK1, USSFO03, and future data centers. The production network is using BGPttH using a spine-leaf fabric. The out-of-band network is using a simple L2 design, using the spanning tree protocol, as well as a set of console servers.
Notably, you can find the configuration for:
our edge routers
Some are running on Junos, like edge2.ussfo03, the others on IOS-XR, like edge1.sk1. The implemented functionalities are similar in both cases and we could swap one for the other. It includes the BGP configuration for transits, peerings, and IX as well as the associated BGP policies. PeeringDB is queried to get the maximum number of prefixes to accept for peerings. bgpq3 and a containerized IRRd help to filter received routes. A firewall is added to protect the routing engine. Both IPv4 and IPv6 are configured.
our BGP-based fabric
BGP is used inside the datacenter8 and is extended on bare-metal hosts. The configuration is automatically derived from the device location and the port number.9 Top-of-the-rack devices are using passive BGP sessions for ports towards servers. They are also serving a provisioning network to let them boot using DHCP and PXE. They also act as a DHCP server. The design is multivendor. Some devices are running Cumulus Linux, like to1-p1.ussfo03, while some others are running Junos, like to1-p2.ussfo03.
our out-of-band fabric
We are using Cisco Catalyst 2960 switches to build an L2 out-of-band network. To provide redundancy and saving a few bucks on wiring, we build small loops and run the spanning-tree protocol. See ob1-p1.ussfo03. It is redundantly connected to our gateway servers. We also use OpenGear devices for console access. See con1-n1.ussfo03
our administrative gateways
These Linux servers have multiple purposes: SSH jump boxes, rescue connection, direct access to the out-of-band network, zero-touch provisioning of network devices,10 Internet access for management flows, centralization of the console servers using Conserver, and API for autoconfiguration of BGP sessions for bare-metal servers. They are the first servers installed in a new datacenter and are used to provision everything else. Check both the generated files and the associated Ansible tasks.

  1. Ansible does not even provide a line number when there is an error in a template. You may need to find the problem by bisecting.
    $ ansible --version
    ansible 2.10.8
    [ ]
    $ cat test.j2
    Hello   name  !
    $ ansible all -i localhost, \
    >  --connection=local \
    >  -m template \
    >  -a "src=test.j2 dest=test.txt"
    localhost   FAILED! =>  
        "changed": false,
        "msg": "AnsibleUndefinedVariable: 'name' is undefined"
     
    
  2. You may recognize the same concepts as in Hiera, the hierarchical key-value store from Puppet. At first, we were using Jerakia, a similar independent store exposing an HTTP REST interface. However, the lookup overhead is too large for our use. Jerikan implements the same functionality within a Python function.
  3. The list of available filters is mangled inside jerikan/jinja.py. This is a remain of the fact we do not maintain Jerikan as a standalone software.
  4. This is a bit confusing: we have a store() filter and a store() function. With Jinja2, filters and functions live in two different namespaces.
  5. We are using a fork with some modifications to be able to validate our configurations and exposing an HTTP service to reduce the time spent on each configuration check.
  6. There is a trend in network automation to automate a configuration subset, for example by having a playbook to create a new BGP session. We believe this is wrong: with time, your configuration will get out-of-sync with its expected state, notably hand-made changes will be left undetected.
  7. See ansible/roles/blade.linux/tasks/firewall.yaml and ansible/roles/blade.linux/tasks/interfaces.yaml. They are meant to be called when needed, using import_role.
  8. We also have some datacenters using BGP EVPN VXLAN at medium-scale using Juniper devices. As they are still in production today, we didn t include this feature but we may publish it in the future.
  9. In retrospect, this may not be a good idea unless you are pretty sure everything is uniform (number of switches for each layer, number of ports). This was not our case. We now think it is a better idea to assign a prefix to each device and write it in the source of truth.
  10. Non-linux based devices are upgraded and configured unattended. Cumulus Linux devices are automatically upgraded on install but the final configuration has to be pushed using Ansible: we didn t want to duplicate the configuration process using another tool.

24 May 2021

Vincent Bernat: Transient prompt with Zsh

Powerlevel10k is a theme for Zsh. It contains some powerful features, is astoundingly fast, and easy to customize. I am quite amazed at the skills of its main author. Be sure to also have a look at Zsh for Humans, a complete Zsh configuration including this theme. One of the nice features of Powerlevel10k is transient prompts: past prompts are reduced to a more minimal configuration to save space by removing unneeded information.
Demonstration of a transient prompt with Zsh: past prompts use a more compact form
My implementation of a transient prompt with Zsh. Past prompts are compact and include the time of the command execution, the hostname, and the status of the previous command while the complete prompt contains more information like the current directory and the Git branch.
When it comes to configuring my shell, I still prefer writing and understanding each line going into it. Therefore, I am still building my Zsh configuration from scratch. Here is how I have integrated the above transient feature into my prompt. The first step is to configure the appearance of the prompt in its compact form. Let s assume we have a variable, $_vbe_prompt_compact set to 1 when we want a compact prompt. We use the following function to define the prompt appearance:
_vbe_prompt ()  
    local retval=$?
    # When compact, just time + prompt sign
    if (( $_vbe_prompt_compact )); then
        # Current time (with timezone for remote hosts)
        _vbe_prompt_segment cyan default "%D %H:%M$ SSH_TTY+ %Z  "
        # Hostname for remote hosts
        [[ $SSH_TTY ]] && \
            _vbe_prompt_segment black magenta "%B%M%b"
        # Status of the last command
        if (( $retval )); then
            _vbe_prompt_segment red default $ PRCH[reta] 
        else
            _vbe_prompt_segment green cyan $ PRCH[ok] 
        fi
        # End of prompt
        _vbe_prompt_end
        return
    fi
    # Regular prompt with many information
    # [ ]
 
setopt prompt_subst
PS1='$(_vbe_prompt) '

Update (2021.05) The following part has been rewritten to be more robust. The code is stolen from Powerlevel10k s issue #888. See the comments for more details.

Our next step is to redraw the prompt after accepting a command. We wrap Zsh line editor into a function:1
_vbe-zle-line-init()  
    [[ $CONTEXT == start ]]   return 0
    # Start regular line editor
    (( $+zle_bracketed_paste )) && print -r -n - $zle_bracketed_paste[1]
    zle .recursive-edit
    local -i ret=$?
    (( $+zle_bracketed_paste )) && print -r -n - $zle_bracketed_paste[2]
    # If we received EOT, we exit the shell
    if [[ $ret == 0 && $KEYS == $'\4' ]]; then
        _vbe_prompt_compact=1
        zle .reset-prompt
        exit
    fi
    # Line edition is over. Shorten the current prompt.
    _vbe_prompt_compact=1
    zle .reset-prompt
    unset _vbe_prompt_compact
    if (( ret )); then
        # Ctrl-C
        zle .send-break
    else
        # Enter
        zle .accept-line
    fi
    return ret
 
zle -N zle-line-init _vbe-zle-line-init
That s all!
One downside of using the powerline fonts is that it messes with copy/paste. As I am using tmux, I use the following snippet to work around this issue and use only standard Unicode characters when copying from the terminal:
bind-key -T copy-mode M-w \
  send -X copy-pipe-and-cancel "sed 's/ .* /%/g'   xclip -i -selection clipboard" \;\
  display-message "Selection saved to clipboard!"
Copying and pasting the text from the screenshot above yields the following text:
14:21 % ssh eizo.luffy.cx
Linux eizo 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
Last login: Fri Apr 23 14:20:39 2021 from 2a01:cb00:3f:b02:9db6:efa4:d85:7f9f
14:21 CEST % uname -a
Linux eizo 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
14:21 CEST %
Connection to eizo.luffy.cx closed.
14:22 % git status
On branch article/zsh-transient
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        ../../media/images/zsh-compact-prompt@2x.jpg
nothing added to commit but untracked files present (use "git add" to track)

  1. We have to manually enable bracketed paste because Zsh does it after zle-line-init.

1 April 2021

Utkarsh Gupta: FOSS Activites in March 2021

Here s my (eighteenth) monthly update about the activities I ve done in the F/L/OSS world.

Debian
This was my 27th month of active contributing to Debian. I became a DM in late March 2019 and a DD on Christmas 19! \o/ This month was a bit exhausting; lots of moving parts. With the financial year ending, it was even more crazy, with me running around to banks, CA, et al.
Anyway, with now working on Ubuntu full-time, I did little of Debian this month. Here are the following things I worked on:

Uploads and bug fixes:

Other $things:
  • Attended the Debian LTS team meeting.
  • Mentoring for newcomers.
  • Moderation of -project mailing list.

Debian (E)LTS
Debian Long Term Support (LTS) is a project to extend the lifetime of all Debian stable releases to (at least) 5 years. Debian LTS is not handled by the Debian security team, but by a separate group of volunteers and companies interested in making it a success. And Debian Extended LTS (ELTS) is its sister project, extending support to the Jessie release (+2 years after LTS support). This was my eighteenth month as a Debian LTS and ninth month as a Debian ELTS paid contributor.
I was assigned 60.00 hours for LTS and 39.00 hours for ELTS and worked on the following things:

LTS CVE Fixes and Announcements:

ELTS CVE Fixes and Announcements:

Other (E)LTS Work:
  • Front-desk duty from 01-03 until 07-03 for ELTS and then from 29-03 until 04-04 for both LTS and ELTS.
  • Triaged wpa, python-aiohttp, spip, wpa, qemu, tomcat7, tomcat8, grub2, mupdf, openssh, tiff, spice, pillow, xmlgraphics-commons, batik, libupnp, ca-certificates, salt, squid3, shibboleth-sp2, courier-authlib, cloud-init, spamassassin, openssl, libcaca, and openjpeg2.
  • Marked CVE-2021-21330/python-aiohttp as not-affected for stretch.
  • Marked CVE-2021-20233, CVE-2021-20225, CVE-2020-27779, CVE-2020-27778, CVE-2020-27749, CVE-2020-27748, CVE-2020-25647, CVE-2020-25632, CVE-2020-25631, and CVE-2020-14372, affecting grub2, as ignored for stretch and jessie.
  • Marked CVE-2020-27842/openjpeg2 as no-dsa for jessie.
  • Marked CVE-2020-27843/openjpeg2 as no-dsa for jessie.
  • Marked CVE-2021-28041/openssh as not-affect for jessie.
  • Marked CVE-2020-3552 3,4 /tiff as no-dsa for jessie.
  • Marked CVE-2021-20201/spice as no-dsa for jessie.
  • Marked CVE-2020-11988/xmlgraphics-commons as postponed for jessie.
  • Marked CVE-2020-11987/batik as postponed for jessie.
  • Marked CVE-2020-12695/libupnp as no-dsa for stretch.
  • Marked CVE-2021-25122/tomcat7 as not-affected for stretch.
  • Marked CVE-2021-25329/tomcat7 as ignored for stretch.
  • Marked CVE-2021-28116/squid3 as postponed for stretch and jessie.
  • Marked CVE-2021-3449/openssl as not-affected for stretch.
  • Document extra notes for grub2 for LTS and co-ordinate with the sec-team.
  • Document extra notes for pillow about piled-up issues in jessie.
  • Issued DLA-2593-1 for ca-certificates on Microsoft s request; co-ordinating w/ them.
  • Co-ordinating w/ maintainer of courier-authlib for stretch and jessie update.
  • Fixing build failures of ELTS security tracker and re-ordering entries in data/CVE-EXTENDED-LTS/list file.
  • Answer queries of dupondje and mikap about openssl on IRC; and it being not-affected for stretch.
  • Help review the status of CVE-2021-3121/golang-github-gogo-protobuf-dev for Ola.
  • Co-ordinating w/ Noah for cloud-init and setuptools.
  • Auto EOL ed mongodb, linux, guacamole-client, node-xmlhttprequest, newlib, neutron, privoxy, glpi, and zabbix for jessie.
  • Attended monthly meeting for Debian LTS.
  • Answered questions (& discussions) on IRC (#debian-lts and #debian-elts).
  • General and other discussions on LTS private and public mailing list.

Until next time.
:wq for today.

21 February 2021

Matthew Garrett: Making hibernation work under Linux Lockdown

Linux draws a distinction between code running in kernel (kernel space) and applications running in userland (user space). This is enforced at the hardware level - in x86-speak[1], kernel space code runs in ring 0 and user space code runs in ring 3[2]. If you're running in ring 3 and you attempt to touch memory that's only accessible in ring 0, the hardware will raise a fault. No matter how privileged your ring 3 code, you don't get to touch ring 0.

Kind of. In theory. Traditionally this wasn't well enforced. At the most basic level, since root can load kernel modules, you could just build a kernel module that performed any kernel modifications you wanted and then have root load it. Technically user space code wasn't modifying kernel space code, but the difference was pretty semantic rather than useful. But it got worse - root could also map memory ranges belonging to PCI devices[3], and if the device could perform DMA you could just ask the device to overwrite bits of the kernel[4]. Or root could modify special CPU registers ("Model Specific Registers", or MSRs) that alter CPU behaviour via the /dev/msr interface, and compromise the kernel boundary that way.

It turns out that there were a number of ways root was effectively equivalent to ring 0, and the boundary was more about reliability (ie, a process running as root that ends up misbehaving should still only be able to crash itself rather than taking down the kernel with it) than security. After all, if you were root you could just replace the on-disk kernel with a backdoored one and reboot. Going deeper, you could replace the bootloader with one that automatically injected backdoors into a legitimate kernel image. We didn't have any way to prevent this sort of thing, so attempting to harden the root/kernel boundary wasn't especially interesting.

In 2012 Microsoft started requiring vendors ship systems with UEFI Secure Boot, a firmware feature that allowed[5] systems to refuse to boot anything without an appropriate signature. This not only enabled the creation of a system that drew a strong boundary between root and kernel, it arguably required one - what's the point of restricting what the firmware will stick in ring 0 if root can just throw more code in there afterwards? What ended up as the Lockdown Linux Security Module provides the tooling for this, blocking userspace interfaces that can be used to modify the kernel and enforcing that any modules have a trusted signature.

But that comes at something of a cost. Most of the features that Lockdown blocks are fairly niche, so the direct impact of having it enabled is small. Except that it also blocks hibernation[6], and it turns out some people were using that. The obvious question is "what does hibernation have to do with keeping root out of kernel space", and the answer is a little convoluted and is tied into how Linux implements hibernation. Basically, Linux saves system state into the swap partition and modifies the header to indicate that there's a hibernation image there instead of swap. On the next boot, the kernel sees the header indicating that it's a hibernation image, copies the contents of the swap partition back into RAM, and then jumps back into the old kernel code. What ensures that the hibernation image was actually written out by the kernel? Absolutely nothing, which means a motivated attacker with root access could turn off swap, write a hibernation image to the swap partition themselves, and then reboot. The kernel would happily resume into the attacker's image, giving the attacker control over what gets copied back into kernel space.

This is annoying, because normally when we think about attacks on swap we mitigate it by requiring an encrypted swap partition. But in this case, our attacker is root, and so already has access to the plaintext version of the swap partition. Disk encryption doesn't save us here. We need some way to verify that the hibernation image was written out by the kernel, not by root. And thankfully we have some tools for that.

Trusted Platform Modules (TPMs) are cryptographic coprocessors[7] capable of doing things like generating encryption keys and then encrypting things with them. You can ask a TPM to encrypt something with a key that's tied to that specific TPM - the OS has no access to the decryption key, and nor does any other TPM. So we can have the kernel generate an encryption key, encrypt part of the hibernation image with it, and then have the TPM encrypt it. We store the encrypted copy of the key in the hibernation image as well. On resume, the kernel reads the encrypted copy of the key, passes it to the TPM, gets the decrypted copy back and is able to verify the hibernation image.

That's great! Except root can do exactly the same thing. This tells us the hibernation image was generated on this machine, but doesn't tell us that it was done by the kernel. We need some way to be able to differentiate between keys that were generated in kernel and ones that were generated in userland. TPMs have the concept of "localities" (effectively privilege levels) that would be perfect for this. Userland is only able to access locality 0, so the kernel could simply use locality 1 to encrypt the key. Unfortunately, despite trying pretty hard, I've been unable to get localities to work. The motherboard chipset on my test machines simply doesn't forward any accesses to the TPM unless they're for locality 0. I needed another approach.

TPMs have a set of Platform Configuration Registers (PCRs), intended for keeping a record of system state. The OS isn't able to modify the PCRs directly. Instead, the OS provides a cryptographic hash of some material to the TPM. The TPM takes the existing PCR value, appends the new hash to that, and then stores the hash of the combination in the PCR - a process called "extension". This means that the new value of the TPM depends not only on the value of the new data, it depends on the previous value of the PCR - and, in turn, that previous value depended on its previous value, and so on. The only way to get to a specific PCR value is to either (a) break the hash algorithm, or (b) perform exactly the same sequence of writes. On system reset the PCRs go back to a known value, and the entire process starts again.

Some PCRs are different. PCR 23, for example, can be reset back to its original value without resetting the system. We can make use of that. The first thing we need to do is to prevent userland from being able to reset or extend PCR 23 itself. All TPM accesses go through the kernel, so this is a simple matter of parsing the write before it's sent to the TPM and returning an error if it's a sensitive command that would touch PCR 23. We now know that any change in PCR 23's state will be restricted to the kernel.

When we encrypt material with the TPM, we can ask it to record the PCR state. This is given back to us as metadata accompanying the encrypted secret. Along with the metadata is an additional signature created by the TPM, which can be used to prove that the metadata is both legitimate and associated with this specific encrypted data. In our case, that means we know what the value of PCR 23 was when we encrypted the key. That means that if we simply extend PCR 23 with a known value in-kernel before encrypting our key, we can look at the value of PCR 23 in the metadata. If it matches, the key was encrypted by the kernel - userland can create its own key, but it has no way to extend PCR 23 to the appropriate value first. We now know that the key was generated by the kernel.

But what if the attacker is able to gain access to the encrypted key? Let's say a kernel bug is hit that prevents hibernation from resuming, and you boot back up without wiping the hibernation image. Root can then read the key from the partition, ask the TPM to decrypt it, and then use that to create a new hibernation image. We probably want to prevent that as well. Fortunately, when you ask the TPM to encrypt something, you can ask that the TPM only decrypt it if the PCRs have specific values. "Sealing" material to the TPM in this way allows you to block decryption if the system isn't in the desired state. So, we define a policy that says that PCR 23 must have the same value at resume as it did on hibernation. On resume, the kernel resets PCR 23, extends it to the same value it did during hibernation, and then attempts to decrypt the key. Afterwards, it resets PCR 23 back to the initial value. Even if an attacker gains access to the encrypted copy of the key, the TPM will refuse to decrypt it.

And that's what this patchset implements. There's one fairly significant flaw at the moment, which is simply that an attacker can just reboot into an older kernel that doesn't implement the PCR 23 blocking and set up state by hand. Fortunately, this can be avoided using another aspect of the boot process. When you boot something via UEFI Secure Boot, the signing key used to verify the booted code is measured into PCR 7 by the system firmware. In the Linux world, the Shim bootloader then measures any additional keys that are used. By either using a new key to tag kernels that have support for the PCR 23 restrictions, or by embedding some additional metadata in the kernel that indicates the presence of this feature and measuring that, we can have a PCR 7 value that verifies that the PCR 23 restrictions are present. We then seal the key to PCR 7 as well as PCR 23, and if an attacker boots into a kernel that doesn't have this feature the PCR 7 value will be different and the TPM will refuse to decrypt the secret.

While there's a whole bunch of complexity here, the process should be entirely transparent to the user. The current implementation requires a TPM 2, and I'm not certain whether TPM 1.2 provides all the features necessary to do this properly - if so, extending it shouldn't be hard, but also all systems shipped in the past few years should have a TPM 2, so that's going to depend on whether there's sufficient interest to justify the work. And we're also at the early days of review, so there's always the risk that I've missed something obvious and there are terrible holes in this. And, well, given that it took almost 8 years to get the Lockdown patchset into mainline, let's not assume that I'm good at landing security code.

[1] Other architectures use different terminology here, such as "supervisor" and "user" mode, but it's broadly equivalent
[2] In theory rings 1 and 2 would allow you to run drivers with privileges somewhere between full kernel access and userland applications, but in reality we just don't talk about them in polite company
[3] This is how graphics worked in Linux before kernel modesetting turned up. XFree86 would just map your GPU's registers into userland and poke them directly. This was not a huge win for stability
[4] IOMMUs can help you here, by restricting the memory PCI devices can DMA to or from. The kernel then gets to allocate ranges for device buffers and configure the IOMMU such that the device can't DMA to anything else. Except that region of memory may still contain sensitive material such as function pointers, and attacks like this can still cause you problems as a result.
[5] This describes why I'm using "allowed" rather than "required" here
[6] Saving the system state to disk and powering down the platform entirely - significantly slower than suspending the system while keeping state in RAM, but also resilient against the system losing power.
[7] With some handwaving around "coprocessor". TPMs can't be part of the OS or the system firmware, but they don't technically need to be an independent component. Intel have a TPM implementation that runs on the Management Engine, a separate processor built into the motherboard chipset. AMD have one that runs on the Platform Security Processor, a small ARM core built into their CPU. Various ARM implementations run a TPM in Trustzone, a special CPU mode that (in theory) is able to access resources that are entirely blocked off from anything running in the OS, kernel or otherwise.

comment count unavailable comments

16 February 2021

Michael Prokop: How to properly use 3rd party Debian repository signing keys with apt

(Blogging this, since this is a recurring anti-pattern I noticed at several customers and often comes up during deployments of 3rd party repositories.) Update on 2021-02-19: clarified, that Signed-By requires apt >= 1.1, thanks Vincent Bernat Many upstream projects provide Debian repository instructions like this:
curl -fsSL https://example.com/stable/debian.gpg   sudo apt-key add -
Do not follow this, for different reasons, including:
  1. You do not see what you get before adding the GPG key to your global apt trust store
  2. You can t easily script this via your preferred configuration management (the apt-key manpage clearly discourages programmatic usage)
  3. The signing key is considered valid for all your enabled Debian repositories (instead of only a specific one)
  4. You need GnuPG (either gnupg2 or gnupg1) on your system for usage with apt-key
There s a much better approach to this: download the GPG key, make sure it s in the appropriate format, then use it via deb [signed-by=/usr/share/keyrings/ ] in your apt s sources list configuration. Note and FTR: the Signed-By feature is available starting with apt 1.1 (so apt in Debian jessie/8 and older does not support it). TL;DR: As an example, let s demonstrate this with the Tailscale Debian repository for buster.
Downloading the GPG file will give you an ascii-armored GPG file:
% curl -fsSL -o buster.gpg https://pkgs.tailscale.com/stable/debian/buster.gpg
% gpg --keyid-format long buster.gpg 
gpg: WARNING: no command supplied.  Trying to guess what you mean ...
pub   rsa4096/458CA832957F5868 2020-02-25 [SC]
      2596A99EAAB33821893C0A79458CA832957F5868
uid                           Tailscale Inc. (Package repository signing key) <info@tailscale.com>
sub   rsa4096/B1547A3DDAAF03C6 2020-02-25 [E]
% file buster.gpg
buster.gpg: PGP public key block Public-Key (old)
If you have apt version >= 1.4 available (Debian >=stretch/9 and Ubuntu >=bionic/18.04), you can use this file directly as follows:
% sudo mv buster.gpg /usr/share/keyrings/tailscale.asc
% cat /etc/apt/sources.list.d/tailscale.list
deb [signed-by=/usr/share/keyrings/tailscale.asc] https://pkgs.tailscale.com/stable/debian buster main
% sudo apt update
[...]
And you re done! Iff your apt version really is older than 1.4, you need to convert the ascii-armored GPG file into a GPG key public ring file (AKA binary OpenPGP format), either by just dearmor-ing it (if you don t care about checking ID + fingerprint):
% gpg --dearmor < buster.gpg > tailscale.gpg
or if you prefer to go via GPG, you can also use a temporary GPG home directory (if you don t care about going through your personal GPG setup):
% mkdir --mode=700 /tmp/gpg-tmpdir
% gpg --homedir /tmp/gpg-tmpdir --import ./buster.gpg
gpg: keybox '/tmp/gpg-tmpdir/pubring.kbx' created
gpg: /tmp/gpg-tmpdir/trustdb.gpg: trustdb created
gpg: key 458CA832957F5868: public key "Tailscale Inc. (Package repository signing key) <info@tailscale.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
% gpg --homedir /tmp/gpg-tmpdir --output tailscale.gpg  --export-options=export-minimal --export 0x458CA832957F5868
% rm -rf /tmp/gpg-tmpdir
The resulting GPG key public ring file should look like that:
% file tailscale.gpg 
tailscale.gpg: PGP/GPG key public ring (v4) created Tue Feb 25 04:51:20 2020 RSA (Encrypt or Sign) 4096 bits MPI=0xc00399b10bc12858...
% gpg tailscale.gpg 
gpg: WARNING: no command supplied.  Trying to guess what you mean ...
pub   rsa4096/458CA832957F5868 2020-02-25 [SC]
      2596A99EAAB33821893C0A79458CA832957F5868
uid                           Tailscale Inc. (Package repository signing key) <info@tailscale.com>
sub   rsa4096/B1547A3DDAAF03C6 2020-02-25 [E]
Then you can use this GPG file on your system as follows:
% sudo mv tailscale.gpg /usr/share/keyrings/tailscale.gpg
% cat /etc/apt/sources.list.d/tailscale.list
deb [signed-by=/usr/share/keyrings/tailscale.gpg] https://pkgs.tailscale.com/stable/debian buster main
% sudo apt update
[...]
Such a setup ensures:
  1. You can verify the GPG key file (ID + fingerprint)
  2. You can easily ship files via /usr/share/keyrings/ and refer to it in your deployment scripts, configuration management, (and can also easily update or get rid of them again!)
  3. The GPG key is valid only for the repositories with the corresponding [signed-by=/usr/share/keyrings/ ] entry
  4. You don t need to install GnuPG (neither gnupg2 nor gnupg1) on the system which is using the 3rd party Debian repository
Thanks: Guillem Jover for reviewing an early draft of this blog article.

17 January 2021

Wouter Verhelst: Software available through Extrepo

Just over 7 months ago, I blogged about extrepo, my answer to the "how do you safely install software on Debian without downloading random scripts off the Internet and running them as root" question. I also held a talk during the recent "MiniDebConf Online" that was held, well, online. The most important part of extrepo is "what can you install through it". If the number of available repositories is too low, there's really no reason to use it. So, I thought, let's look what we have after 7 months... To cut to the chase, there's a bunch of interesting content there, although not all of it has a "main" policy. Each of these can be enabled by installing extrepo, and then running extrepo enable <reponame>, where <reponame> is the name of the repository. Note that the list is not exhaustive, but I intend to show that even though we're nowhere near complete, extrepo is already quite useful in its current state:

Free software
  • The debian_official, debian_backports, and debian_experimental repositories contain Debian's official, backports, and experimental repositories, respectively. These shouldn't have to be managed through extrepo, but then again it might be useful for someone, so I decided to just add them anyway. The config here uses the deb.debian.org alias for CDN-backed package mirrors.
  • The belgium_eid repository contains the Belgian eID software. Obviously this is added, since I'm upstream for eID, and as such it was a large motivating factor for me to actually write extrepo in the first place.
  • elastic: the elasticsearch software.
  • Some repositories, such as dovecot, winehq and bareos contain upstream versions of their respective software. These two repositories contain software that is available in Debian, too; but their upstreams package their most recent release independently, and some people might prefer to run those instead.
  • The sury, fai, and postgresql repositories, as well as a number of repositories such as openstack_rocky, openstack_train, haproxy-1.5 and haproxy-2.0 (there are more) contain more recent versions of software packaged in Debian already by the same maintainer of that package repository. For the sury repository, that is PHP; for the others, the name should give it away. The difference between these repositories and the ones above is that it is the official Debian maintainer for the same software who maintains the repository, which is not the case for the others.
  • The vscodium repository contains the unencumbered version of Microsoft's Visual Studio Code; i.e., the codium version of Visual Studio Code is to code as the chromium browser is to chrome: it is a build of the same softare, but without the non-free bits that make code not entirely Free Software.
  • While Debian ships with at least two browsers (Firefox and Chromium), additional browsers are available through extrepo, too. The iridiumbrowser repository contains a Chromium-based browser that focuses on privacy.
  • Speaking of privacy, perhaps you might want to try out the torproject repository.
  • For those who want to do Cloud Computing on Debian in ways that isn't covered by Openstack, there is a kubernetes repository that contains the Kubernetes stack, the as well as the google_cloud one containing the Google Cloud SDK.

Non-free software While these are available to be installed through extrepo, please note that non-free and contrib repositories are disabled by default. In order to enable these repositories, you must first enable them; this can be accomplished through /etc/extrepo/config.yaml.
  • In case you don't care about freedom and want the official build of Visual Studio Code, the vscode repository contains it.
  • While we're on the subject of Microsoft, there's also Microsoft Teams available in the msteams repository. And, hey, skype.
  • For those who are not satisfied with the free browsers in Debian or any of the free repositories, there's opera and google_chrome.
  • The docker-ce repository contains the official build of Docker CE. While this is the free "community edition" that should have free licenses, I could not find a licensing statement anywhere, and therefore I'm not 100% sure whether this repository is actually free software. For that reason, it is currently marked as a non-free one. Merge Requests for rectifying that from someone with more information on the actual licensing situation of Docker CE would be welcome...
  • For gamers, there's Valve's steam repository.
Again, the above lists are not meant to be exhaustive. Special thanks go out to Russ Allbery, Kim Alvefur, Vincent Bernat, Nick Black, Arnaud Ferraris, Thorsten Glaser, Thomas Goirand, Juri Grabowski, Paolo Greppi, and Josh Triplett, for helping me build the current list of repositories. Is your favourite repository not listed? Create a configuration based on template.yaml, and file a merge request!

30 November 2020

Chris Lamb: Free software activities in November 2020

Here is my monthly update covering what I have been doing in the free software world during November 2020 (previous month):

Reproducible Builds One of the original promises of open source software is that distributed peer review and transparency of process results in enhanced end-user security. However, whilst anyone may inspect the source code of free and open source software for malicious flaws, almost all software today is distributed as pre-compiled binaries. This allows nefarious third-parties to compromise systems by injecting malicious code into ostensibly secure software during the various compilation and distribution processes. The motivation behind the Reproducible Builds effort is to ensure no flaws have been introduced during this compilation process by promising identical results are always generated from a given source, thus allowing multiple third-parties to come to a consensus on whether a build was compromised. The project is proud to be a member project of the Software Freedom Conservancy. Conservancy acts as a corporate umbrella allowing projects to operate as non-profit initiatives without managing their own corporate structure. If you like the work of the Conservancy or the Reproducible Builds project, please consider becoming an official supporter.
This month, I:
I also made the following changes to diffoscope:

Debian I performed the following uploads to the Debian Linux distribution this month: I also filed a release-critical bug against the minidlna package which could not be successfully purged from the system without reporting a cannot remove '/var/log/minidlna' error. (#975372)

Debian LTS This month I have worked 18 hours on Debian Long Term Support (LTS) and 12 hours on its sister Extended LTS project, including: You can find out more about the Debian LTS project via the following video:

15 November 2020

Vincent Bernat: Zero-Touch Provisioning for Juniper

Juniper s official documentation on ZTP explains how to configure the ISC DHCP Server to automatically upgrade and configure on first boot a Juniper device. However, the proposed configuration could be a bit more elegant. This note explains how.

TL;DR Do not redefine option 43. Instead, specify the vendor option space to use to encode parameters with vendor-option-space.


When booting for the first time, a Juniper device requests its IP address through a DHCP discover message, then request additional parameters for autoconfiguration through a DHCP request message:
Dynamic Host Configuration Protocol (Request)
    Message type: Boot Request (1)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x44e3a7c9
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 0.0.0.0
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: 02:00:00:00:00:01 (02:00:00:00:00:01)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name not given
    Magic cookie: DHCP
    Option: (54) DHCP Server Identifier (10.0.2.2)
    Option: (55) Parameter Request List
        Length: 14
        Parameter Request List Item: (3) Router
        Parameter Request List Item: (51) IP Address Lease Time
        Parameter Request List Item: (1) Subnet Mask
        Parameter Request List Item: (15) Domain Name
        Parameter Request List Item: (6) Domain Name Server
        Parameter Request List Item: (66) TFTP Server Name
        Parameter Request List Item: (67) Bootfile name
        Parameter Request List Item: (120) SIP Servers
        Parameter Request List Item: (44) NetBIOS over TCP/IP Name Server
        Parameter Request List Item: (43) Vendor-Specific Information
        Parameter Request List Item: (150) TFTP Server Address
        Parameter Request List Item: (12) Host Name
        Parameter Request List Item: (7) Log Server
        Parameter Request List Item: (42) Network Time Protocol Servers
    Option: (50) Requested IP Address (10.0.2.15)
    Option: (53) DHCP Message Type (Request)
    Option: (60) Vendor class identifier
        Length: 15
        Vendor class identifier: Juniper-mx10003
    Option: (51) IP Address Lease Time
    Option: (12) Host Name
    Option: (255) End
    Padding: 00
It requests several options, including the TFTP server address option 150, and the Vendor-Specific Information Option 43 or VSIO. The DHCP server can use option 60 to identify the vendor-specific information to send. For Juniper devices, option 43 encodes the image name and the configuration file name. They are fetched from the IP address provided in option 150. The official documentation on ZTP provides a valid configuration to answer such a request. However, it does not leverage the ability of the ISC DHCP Server to support several vendors and redefines option 43 to be Juniper-specific:
option NEW_OP-encapsulation code 43 = encapsulate NEW_OP;
Instead, it is possible to define an option space for Juniper, using a self-descriptive name, without overriding option 43:
# Juniper vendor option space
option space juniper;
option juniper.image-file-name     code 0 = text;
option juniper.config-file-name    code 1 = text;
option juniper.image-file-type     code 2 = text;
option juniper.transfer-mode       code 3 = text;
option juniper.alt-image-file-name code 4 = text;
option juniper.http-port           code 5 = text;
Then, when you need to set these suboptions, specify the vendor option space:
class "juniper-mx10003"  
  match if (option vendor-class-identifier = "Juniper-mx10003")  
  vendor-option-space juniper;
  option juniper.transfer-mode    "http";
  option juniper.image-file-name  "/images/junos-vmhost-install-mx-x86-64-19.3R2-S4.5.tgz";
  option juniper.config-file-name "/cfg/juniper-mx10003.txt";
 
This configuration returns the following answer:1
Dynamic Host Configuration Protocol (ACK)
    Message type: Boot Reply (2)
    Hardware type: Ethernet (0x01)
    Hardware address length: 6
    Hops: 0
    Transaction ID: 0x44e3a7c9
    Seconds elapsed: 0
    Bootp flags: 0x8000, Broadcast flag (Broadcast)
    Client IP address: 0.0.0.0
    Your (client) IP address: 10.0.2.15
    Next server IP address: 0.0.0.0
    Relay agent IP address: 0.0.0.0
    Client MAC address: 02:00:00:00:00:01 (02:00:00:00:00:01)
    Client hardware address padding: 00000000000000000000
    Server host name not given
    Boot file name not given
    Magic cookie: DHCP
    Option: (53) DHCP Message Type (ACK)
    Option: (54) DHCP Server Identifier (10.0.2.2)
    Option: (51) IP Address Lease Time
    Option: (1) Subnet Mask (255.255.255.0)
    Option: (3) Router
    Option: (6) Domain Name Server
    Option: (43) Vendor-Specific Information
        Length: 89
        Value: 00362f696d616765732f6a756e6f732d766d686f73742d69 
    Option: (150) TFTP Server Address
    Option: (255) End
Using vendor-option-space directive allows you to make different ZTP implementations coexist. For example, you can add the option space for PXE:
option space PXE;
option PXE.mtftp-ip    code 1 = ip-address;
option PXE.mtftp-cport code 2 = unsigned integer 16;
option PXE.mtftp-sport code 3 = unsigned integer 16;
option PXE.mtftp-tmout code 4 = unsigned integer 8;
option PXE.mtftp-delay code 5 = unsigned integer 8;
option PXE.discovery-control    code 6  = unsigned integer 8;
option PXE.discovery-mcast-addr code 7  = ip-address;
option PXE.boot-server code 8  =   unsigned integer 16, unsigned integer 8, ip-address  ;
option PXE.boot-menu   code 9  =   unsigned integer 16, unsigned integer 8, text  ;
option PXE.menu-prompt code 10 =   unsigned integer 8, text  ;
option PXE.boot-item   code 71 = unsigned integer 32;
class "pxeclients"  
  match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
  vendor-option-space PXE;
  option PXE.mtftp-ip 10.0.2.2;
  # [ ]
 
On the same topic, do not override option 125 VIVSO. See Zero-Touch Provisioning for Cisco IOS.

  1. Wireshark knows how to decode option 43 for some vendors, thanks to option 60, but not for Juniper.

2 November 2020

Vincent Bernat: My collection of vintage PC cards

Recently, I have been gathering some old hardware at my parents house, notably PC extension cards, as they don t take much room and can be converted to a nice display item. Unfortunately, I was not very concerned about keeping stuff around. Compared to all the hardware I have acquired over the years, only a few pieces remain.

Tseng Labs ET4000AX (1989) This SVGA graphics card was installed into a PC powered by a 386SX CPU running at 16 MHz. This was a good card at the time as it was pretty fast. It didn t feature 2D acceleration, unlike the later ET4000/W32. This version only features 512 KB of RAM. It can display 1024 768 images with 16 colors or 800 600 with 256 colors. It was also compatible with CGA, EGA, VGA, MDA, and Hercules modes. No contemporary games were using the SVGA modes but the higher resolutions were useful with Windows 3. This card was manufactured directly by Tseng Labs.
Carte Tseng Labs ET4000AX ISA au-dessus de la bo te "Plan te Aventure"
Tseng Labs ET4000 AX ISA card

AdLib clone (1992) My first sound card was an AdLib. My parents bought it in Canada during the summer holidays in 1992. It uses a Yamaha OPL2 chip to produce sound via FM synthesis. The first game I have tried is Indiana Jones and the Last Crusade. I think I gave this AdLib to a friend once I upgraded my PC with a Sound Blaster Pro 2. Recently, I needed one for a side project, but they are rare and expensive on eBay. Someone mentioned a cheap clone on Vogons, so I bought it. It was sold by Sun Moon Star in 1992 and shipped with a CD-ROM of Doom shareware.
AdLib clone on top of "Alone in the Dark" box
AdLib clone ISA card by Sun Moon Star
On this topic, take a look at OPL2LPT: an AdLib sound card for the parallel port and OPL2 Audio Board: an AdLib sound card for Arduino .

Sound Blaster Pro 2 (1992) Later, I switched the AdLib sound card with a Sound Blaster Pro 2. It features an OPL3 chip and was also able to output digital samples. At the time, this was a welcome addition, but not as important as the FM synthesis introduced earlier by the AdLib.
Sound Blaster Pro 2 on top of "Day of the Tentacle" box
Sound Blaster Pro 2 ISA card

Promise EIDE 2300 Plus (1995) I bought this card mostly for the serial port. I was using a 486DX2 running at 66 MHz with a Creatix LC 288 FC external modem. The serial port was driven by an 8250 UART with no buffer. Thanks to Terminate, I was able to connect to BBSes with DOS, but this was not possible with Windows 3 or OS/2. I needed one of these fancy new cards with a 16550 UART, featuring a 16-byte buffer. At the time, this was quite difficult to find in France. During a holiday trip, I convinced my parent to make a short detour from Los Angeles to San Diego to buy this Promise EIDE 2300 Plus controller card at a shop I located through an advertisement in a local magazine! The card also features an EIDE controller with multi-word DMA mode 2 support. In contrast with the older PIO modes, the CPU didn t have to copy data from disk to memory.
Promise EIDE 2300 Plus next to an OS/2 Warp CD
Promise EIDE 2300 Plus VLB card

3dfx Voodoo2 Magic 3D II (1998) The 3dfx Voodoo2 was one of the first add-in graphics cards implementing hardware acceleration of 3D graphics. I bought it from a friend along with his Pentium II box in 1999. It was a big evolutionary step in PC gaming, as games became more beautiful and fluid. A traditional video controller was still required for 2D. A pass-through VGA cable daisy-chained the video controller to the Voodoo, which was itself connected to the monitor.
3dfx Voodoo 2 Magic 3D II on top of "Jedi Knight: Dark Forces II" box
3dfx Voodoo2 Magic 3D II PCI card

3Com 3C905C-TX-M Tornado (1999) In the early 2000s, in college, the Internet connection on the campus was provided by a student association through a 100 Mbps Ethernet cable. If you wanted to reach the maximum speed, the 3Com 3C905C-TX-M PCI network adapter, nicknamed Tornado , was the card you needed. We would buy it second-hand by the dozen and sell them to other students for around 30 .
3COM 3C905C-TX-M on top of "Red Alert" box
3Com 3C905C-TX-M PCI card

1 November 2020

Vincent Bernat: Running Isso on NixOS in a Docker container

This short article documents how I run Isso, the commenting system used by this blog, inside a Docker container on NixOS, a Linux distribution built on top of Nix. Nix is a declarative package manager for Linux and other Unix systems.
While NixOS 20.09 includes a derivation for Isso, it is unfortunately broken and relies on Python 2. As I am also using a fork of Isso, I have built my own derivation, heavily inspired by the one in master:1
issoPackage = with pkgs.python3Packages; buildPythonPackage rec  
  pname = "isso";
  version = "custom";
  src = pkgs.fetchFromGitHub  
    # Use my fork
    owner = "vincentbernat";
    repo = pname;
    rev = "vbe/master";
    sha256 = "0vkkvjcvcjcdzdj73qig32hqgjly8n3ln2djzmhshc04i6g9z07j";
   ;
  propagatedBuildInputs = [
    itsdangerous
    jinja2
    misaka
    html5lib
    werkzeug
    bleach
    flask-caching
  ];
  buildInputs = [
    cffi
  ];
  checkInputs = [ nose ];
  checkPhase = ''
    $ python.interpreter  setup.py nosetests
  '';
 ;
I want to run Isso through Gunicorn. To this effect, I build a Python environment combining Isso and Gunicorn. Then, I can invoke the latter with "$ issoEnv /bin/gunicorn", like with a virtual environment.
issoEnv = pkgs.python3.buildEnv.override  
    extraLibs = [
      issoPackage
      pkgs.python3Packages.gunicorn
      pkgs.python3Packages.gevent
    ];
 ;
Before building a Docker image, I also need to specify the Isso configuration file for Isso:
issoConfig = pkgs.writeText "isso.conf" ''
  [general]
  dbpath = /db/comments.db
  host =
    https://vincent.bernat.ch
    http://localhost:8080
  notify = smtp
  [ ]
'';
NixOS comes with a convenient tool to build a Docker image without a Dockerfile:
issoDockerImage = pkgs.dockerTools.buildImage  
  name = "isso";
  tag = "latest";
  extraCommands = ''
    mkdir -p db
  '';
  config =  
    Cmd = [ "$ issoEnv /bin/gunicorn"
            "--name" "isso"
            "--bind" "0.0.0.0:$ port "
            "--worker-class" "gevent"
            "--workers" "2"
            "--worker-tmp-dir" "/dev/shm"
            "--preload"
            "isso.run"
          ];
    Env = [
      "ISSO_SETTINGS=$ issoConfig "
      "SSL_CERT_FILE=$ pkgs.cacert /etc/ssl/certs/ca-bundle.crt"
    ];
   ;
 ;
Because we refer to the issoEnv derivation in config.Cmd, the whole derivation, including Isso and Gunicorn, is copied inside the Docker image. The same applies for issoConfig, the configuration file we created earlier, and pkgs.cacert, the derivation containing trusted root certificates. The resulting image is 171 MB once installed, which is comparable to the Debian Buster image generated by the official Dockerfile. NixOS features an abstraction to run Docker containers. It is not currently documented in NixOS manual but you can look at the source code of the module for the available options. I choose to use Podman instead of Docker as the backend because it does not require running an additional daemon.
virtualisation.oci-containers =  
  backend = "podman";
  containers =  
    isso =  
      image = "isso";
      imageFile = issoDockerImage;
      ports = ["127.0.0.1:$ port :$ port "];
      volumes = [
        "/var/db/isso:/db"
      ];
     ;
   ;
 ;
A systemd unit file is automatically created to run and supervise the container:
$ systemctl status podman-isso.service
  podman-isso.service
     Loaded: loaded (/nix/store/a66gzqqwcdzbh99sz8zz5l5xl8r8ag7w-unit->
     Active: active (running) since Sun 2020-11-01 16:04:16 UTC; 4min 44s ago
    Process: 14564 ExecStartPre=/nix/store/95zfn4vg4867gzxz1gw7nxayqcl>
   Main PID: 14697 (podman)
         IP: 0B in, 0B out
      Tasks: 10 (limit: 2313)
     Memory: 221.3M
        CPU: 10.058s
     CGroup: /system.slice/podman-isso.service
              14697 /nix/store/pn52xgn1wb2vr2kirq3xj8ij0rys35mf-podma>
              14802 /nix/store/7vsba54k6ag4cfsfp95rvjzqf6rab865-conmo>
nov. 01 16:04:17 web03 podman[14697]: container init (image=localhost/isso:latest)
nov. 01 16:04:17 web03 podman[14697]: container start (image=localhost/isso:latest)
nov. 01 16:04:17 web03 podman[14697]: container attach (image=localhost/isso:latest)
nov. 01 16:04:19 web03 conmon[14802]: INFO: connected to SMTP server
nov. 01 16:04:19 web03 conmon[14802]: INFO: connected to https://vincent.bernat.ch
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Starting gunicorn 20.0.4
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Listening at: http://0.0.0.0:8080 (1)
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Using worker: gevent
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Booting worker with pid: 8
nov. 01 16:04:19 web03 conmon[14802]: [INFO] Booting worker with pid: 9
As the last step, we configure Nginx to forward requests for comments.luffy.cx to the container. NixOS provides a simple integration to grab a Let s Encrypt certificate.
services.nginx.virtualHosts."comments.luffy.cx" =  
  root = "/data/webserver/comments.luffy.cx";
  enableACME = true;
  forceSSL = true;
  extraConfig = ''
    access_log /var/log/nginx/comments.luffy.cx.log anonymous;
  '';
  locations."/" =  
    proxyPass = "http://127.0.0.1:$ port ";
    extraConfig = ''
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-Proto $scheme;
      proxy_hide_header Set-Cookie;
      proxy_hide_header X-Set-Cookie;
      proxy_ignore_headers Set-Cookie;
    '';
   ;
 ;
security.acme.certs."comments.luffy.cx" =  
  email = lib.concatStringsSep "@" [ "letsencrypt" "vincent.bernat.ch" ];
 ;

While I still struggle with Nix and NixOS, I am convinced this is how declarative infrastructure should be done. I like how in one single file, I can define the derivation to build Isso, the configuration, the Docker image, the container definition, and the Nginx configuration. The Nix language is used both for building packages and for managing configurations. Moreover, the Docker image is updated automatically like a regular NixOS host. This solves an issue plaguing the Docker ecosystem: no more stale images! My next step would be to combine this approach with Nomad, a simple orchestrator to deploy and manage containers.

  1. There is a subtle difference: I am using buildPythonPackage instead of buildPythonApplication. This is important for the next step. I didn t investigate if an application can be converted to a package easily.

29 September 2020

Vincent Bernat: Speeding up bgpq4 with IRRd in a container

When building route filters with bgpq4 or bgpq3, the speed of rr.ntt.net or whois.radb.net can be a bottleneck. Updating many filters may take several tens of minutes, depending on the load:
$ time bgpq4 -h whois.radb.net AS-HURRICANE   wc -l
909869
1.96s user 0.15s system 2% cpu 1:17.64 total
$ time bgpq4 -h rr.ntt.net AS-HURRICANE   wc -l
927865
1.86s user 0.08s system 12% cpu 14.098 total
A possible solution is to run your own IRRd instance in your network, mirroring the main routing registries. A close alternative is to bundle IRRd with all the data in a ready-to-use Docker image. This also has the advantage of easy integration into a Docker-based CI/CD pipeline.
$ git clone https://github.com/vincentbernat/irrd-legacy.git -b blade/master
$ cd irrd-legacy
$ docker build . -t irrd-snapshot:latest
[ ]
Successfully built 58c3e83a1d18
Successfully tagged irrd-snapshot:latest
$ docker container run --rm --detach --publish=43:43 irrd-snapshot
4879cfe7413075a0c217089dcac91ed356424c6b88808d8fcb01dc00eafcc8c7
$ time bgpq4 -h localhost AS-HURRICANE   wc -l
904137
1.72s user 0.11s system 96% cpu 1.881 total
The Dockerfile contains three stages:
  1. building IRRd,1
  2. retrieving various IRR databases, and
  3. assembling the final container with the result of the two previous stages.
The second stage fetches the databases used by rr.ntt.net: NTTCOM, RADB, RIPE, ALTDB, BELL, LEVEL3, RGNET, APNIC, JPIRR, ARIN, BBOI, TC, AFRINIC, ARIN-WHOIS, and REGISTROBR. However, it misses RPKI.2 Feel free to adapt! The image can be scheduled to be rebuilt daily or weekly, depending on your needs. The repository includes a .gitlab-ci.yaml file automating the build and triggering the compilation of all filters by your CI/CD upon success.

  1. Instead of using the latest version of IRRd, the image relies on an older version that does not require a PostgreSQL instance and uses flat files instead.
  2. Unlike the others, the RPKI database is built from the published RPKI ROAs. They can be retrieved with rpki-client and transformed into RPSL objects to be imported in IRRd.

28 September 2020

Vincent Bernat: Syncing RIPE, ARIN and APNIC objects with a custom Ansible module

Internet is split into five regional Internet registry: AFRINIC, ARIN, APNIC, LACNIC and RIPE. Each RIR maintains an Internet Routing Registry. An IRR allows one to publish information about the routing of Internet number resources.1 Operators use this to determine the owner of an IP address and to construct and maintain routing filters. To ensure your routes are widely accepted, it is important to keep the prefixes you announce up-to-date in an IRR. There are two common tools to query this database: whois and bgpq4. The first one allows you to do a query with the WHOIS protocol:
$ whois -BrG 2a0a:e805:400::/40
[ ]
inet6num:       2a0a:e805:400::/40
netname:        FR-BLADE-CUSTOMERS-DE
country:        DE
geoloc:         50.1109 8.6821
admin-c:        BN2763-RIPE
tech-c:         BN2763-RIPE
status:         ASSIGNED
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2020-05-19T08:04:58Z
last-modified:  2020-05-19T08:04:58Z
source:         RIPE
route6:         2a0a:e805:400::/40
descr:          Blade IPv6 - AMS1
origin:         AS64476
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2019-10-01T08:19:34Z
last-modified:  2020-05-19T08:05:00Z
source:         RIPE
The second one allows you to build route filters using the information contained in the IRR database:
$ bgpq4 -6 -S RIPE -b AS64476
NN = [
    2a0a:e805::/40,
    2a0a:e805:100::/40,
    2a0a:e805:300::/40,
    2a0a:e805:400::/40,
    2a0a:e805:500::/40
];
There is no module available on Ansible Galaxy to manage these objects. Each IRR has different ways of being updated. Some RIRs propose an API but some don t. If we restrict ourselves to RIPE, ARIN and APNIC, the only common method to update objects is email updates, authenticated with a password or a GPG signature.2 Let s write a custom Ansible module for this purpose!

Notice I recommend that you read Writing a custom Ansible module as an introduction, as well as Syncing MySQL tables for a more instructive example.

Code The module takes a list of RPSL objects to synchronize and returns the body of an email update if a change is needed:
- name: prepare RIPE objects
  irr_sync:
    irr: RIPE
    mntner: fr-blade-1-mnt
    source: whois-ripe.txt
  register: irr

Prerequisites The source file should be a set of objects to sync using the RPSL language. This would be the same content you would send manually by email. All objects should be managed by the same maintainer, which is also provided as a parameter. Signing and sending the result is not the responsibility of this module. You need two additional tasks for this purpose:
- name: sign RIPE objects
  shell:
    cmd: gpg --batch --user noc@example.com --clearsign
    stdin: "  irr.objects  "
  register: signed
  check_mode: false
  changed_when: false
- name: update RIPE objects by email
  mail:
    subject: "NEW: update for RIPE"
    from: noc@example.com
    to: "auto-dbm@ripe.net"
    cc: noc@example.com
    host: smtp.example.com
    port: 25
    charset: us-ascii
    body: "  signed.stdout  "
You also need to authorize the PGP keys used to sign the updates by creating a key-cert object and adding it as a valid authentication method for the corresponding mntner object:
key-cert:  PGPKEY-A791AAAB
certif:    -----BEGIN PGP PUBLIC KEY BLOCK-----
certif:    
certif:    mQGNBF8TLY8BDADEwP3a6/vRhEERBIaPUAFnr23zKCNt5YhWRZyt50mKq1RmQBBY
[ ]
certif:    -----END PGP PUBLIC KEY BLOCK-----
mnt-by:    fr-blade-1-mnt
source:    RIPE
mntner:    fr-blade-1-mnt
[ ]
auth:      PGPKEY-A791AAAB
mnt-by:    fr-blade-1-mnt
source:    RIPE

Module definition Starting from the skeleton described in the previous article, we define the module:
module_args = dict(
    irr=dict(type='str', required=True),
    mntner=dict(type='str', required=True),
    source=dict(type='path', required=True),
)
result = dict(
    changed=False,
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)

Getting existing objects To grab existing objects, we use the whois command to retrieve all the objects from the provided maintainer.
# Per-IRR variations:
# - whois server
whois =  
    'ARIN': 'rr.arin.net',
    'RIPE': 'whois.ripe.net',
    'APNIC': 'whois.apnic.net'
 
# - whois options
options =  
    'ARIN': ['-r'],
    'RIPE': ['-BrG'],
    'APNIC': ['-BrG']
 
# - objects excluded from synchronization
excluded = ["domain"]
if irr == "ARIN":
    # ARIN does not return these objects
    excluded.extend([
        "key-cert",
        "mntner",
    ])
# Grab existing objects
args = ["-h", whois[irr],
        "-s", irr,
        *options[irr],
        "-i", "mnt-by",
        module.params['mntner']]
proc = subprocess.run("whois", *args, capture_output=True)
if proc.returncode != 0:
    raise AnsibleError(
        f"unable to query whois:  args ")
output = proc.stdout.decode('ascii')
got = extract(output, excluded)
The first part of the code setup some IRR-specific constants: the server to query, the options to provide to the whois command and the objects to exclude from synchronization. The second part invokes the whois command, requesting all objects whose mnt-by field is the provided maintainer. Here is an example of output:
$ whois -h whois.ripe.net -s RIPE -BrG -i mnt-by fr-blade-1-mnt
[ ]
inet6num:       2a0a:e805:300::/40
netname:        FR-BLADE-CUSTOMERS-FR
country:        FR
geoloc:         48.8566 2.3522
admin-c:        BN2763-RIPE
tech-c:         BN2763-RIPE
status:         ASSIGNED
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2020-05-19T08:04:59Z
last-modified:  2020-05-19T08:04:59Z
source:         RIPE
[ ]
route6:         2a0a:e805:300::/40
descr:          Blade IPv6 - PA1
origin:         AS64476
mnt-by:         fr-blade-1-mnt
remarks:        synced with cmdb
created:        2019-10-01T08:19:34Z
last-modified:  2020-05-19T08:05:00Z
source:         RIPE
[ ]
The result is passed to the extract() function. It parses and normalizes the results into a dictionary mapping object names to objects. We store the result in the got variable.
def extract(raw, excluded):
    """Extract objects."""
    # First step, remove comments and unwanted lines
    objects = "\n".join([obj
                         for obj in raw.split("\n")
                         if not obj.startswith((
                                 "#",
                                 "%",
                         ))])
    # Second step, split objects
    objects = [RPSLObject(obj.strip())
               for obj in re.split(r"\n\n+", objects)
               if obj.strip()
               and not obj.startswith(
                   tuple(f" x :" for x in excluded))]
    # Last step, put objects in a dict
    objects =  repr(obj): obj
               for obj in objects 
    return objects
RPSLObject() is a class enabling normalization and comparison of objects. Look at the module code for more details.
>>> output="""
... inet6num:       2a0a:e805:300::/40
... [ ]
... """
>>> pprint( k: str(v) for k,v in extract(output, excluded=[]) )
 '<Object:inet6num:2a0a:e805:300::/40>':
   'inet6num:       2a0a:e805:300::/40\n'
   'netname:        FR-BLADE-CUSTOMERS-FR\n'
   'country:        FR\n'
   'geoloc:         48.8566 2.3522\n'
   'admin-c:        BN2763-RIPE\n'
   'tech-c:         BN2763-RIPE\n'
   'status:         ASSIGNED\n'
   'mnt-by:         fr-blade-1-mnt\n'
   'remarks:        synced with cmdb\n'
   'source:         RIPE',
 '<Object:route6:2a0a:e805:300::/40>':
   'route6:         2a0a:e805:300::/40\n'
   'descr:          Blade IPv6 - PA1\n'
   'origin:         AS64476\n'
   'mnt-by:         fr-blade-1-mnt\n'
   'remarks:        synced with cmdb\n'
   'source:         RIPE' 

Comparing with wanted objects Let s build the wanted dictionary using the same structure, thanks to the extract() function we can use verbatim:
with open(module.params['source']) as f:
    source = f.read()
wanted = extract(source, excluded)
The next step is to compare got and wanted to build the diff object:
if got != wanted:
    result['changed'] = True
    if module._diff:
        result['diff'] = [
            dict(before_header=k,
                 after_header=k,
                 before=str(got.get(k, "")),
                 after=str(wanted.get(k, "")))
            for k in set((*wanted.keys(), *got.keys()))
            if k not in wanted or k not in got or wanted[k] != got[k]]

Returning updates The module does not have a side effect. If there is a difference, we return the updates to send by email. We choose to include all wanted objects in the updates (contained in the source variable) and let the IRR ignore unmodified objects. We also append the objects to be deleted by adding a delete: attribute to each them them.
# We send all source objects and deleted objects.
deleted_mark = f" 'delete:':16 deleted by CMDB"
deleted = "\n\n".join([f" got[k].raw \n deleted_mark "
                       for k in got
                       if k not in wanted])
result['objects'] = f" source \n\n deleted "
module.exit_json(**result)

The complete code is available on GitHub. The module supports both --diff and --check flags. It does not return anything if no change is detected. It can work with APNIC, RIPE and ARIN. It is not perfect: it may not detect some changes,3 it is not able to modify objects not owned by the provided maintainer4 and some attributes cannot be modified, requiring to manually delete and recreate the updated object.5 However, this module should automate 95% of your needs.

  1. Other IRRs exist without being attached to a RIR. The most notable one is RADb.
  2. ARIN is phasing out this method in favor of IRR-online. RIPE has an API available, but email updates are still supported and not planned to be deprecated. APNIC plans to expose an API.
  3. For ARIN, we cannot query key-cert and mntner objects and therefore we cannot detect changes in them. It is also not possible to detect changes to the auth mechanisms of a mntner object.
  4. APNIC do not assign top-level objects to the maintainer associated with the owner.
  5. Changing the status of an inetnum object requires deleting and recreating the object.

19 September 2020

Vincent Bernat: Keepalived and unicast over multiple interfaces

Keepalived is a Linux implementation of VRRP. The usual role of VRRP is to share a virtual IP across a set of routers. For each VRRP instance, a leader is elected and gets to serve the IP address, ensuring the high availability of the attached service. Keepalived can also be used for a generic leader election, thanks to its ability to use scripts for healthchecking and run commands on state change. A simple configuration looks like this:
vrrp_instance gateway1  
  state BACKUP          #  
  interface eth0        #  
  virtual_router_id 12  #  
  priority 101          #  
  virtual_ipaddress  
    2001:db8:ff/64
   
 
The state keyword in instructs Keepalived to not take the leader role when starting. Otherwise, incoming nodes create a temporary disruption by taking over the IP address until the election settles. The interface keyword in defines the interface for sending and receiving VRRP packets. It is also the default interface to configure the virtual IP address. The virtual_router_id directive in is common to all nodes sharing the virtual IP. The priority keyword in helps choosing which router will be elected as leader. If you need more information around Keepalived, be sure to check the documentation. VRRP design is tied to Ethernet networks and requires a multicast-enabled network for communication between nodes. In some environments, notably public clouds, multicast is unavailable. In this case, Keepalived can send VRRP packets using unicast:
vrrp_instance gateway1  
  state BACKUP
  interface eth0
  virtual_router_id 12
  priority 101
  unicast_peer  
    2001:db8::11
    2001:db8::12
   
  virtual_ipaddress  
    2001:db8:ff/64 dev lo
   
 
Another process, like a BGP daemon, should advertise the virtual IP address to the network . If needed, Keepalived can trigger whatever action is needed for this by using notify_* scripts. Until version 2.21 (not released yet), the interface directive is mandatory and Keepalived will transmit and receive VRRP packets on this interface only. If peers are reachable through several interfaces, like on a BGP on the host setup, you need a workaround. A simple one is to use a VXLAN interface:
$ ip -6 link add keepalived6 type vxlan id 6 dstport 4789 local 2001:db8::10 nolearning
$ bridge fdb append 00:00:00:00:00:00 dev keepalived6 dst 2001:db8::11
$ bridge fdb append 00:00:00:00:00:00 dev keepalived6 dst 2001:db8::12
$ ip link set up dev keepalived6
Learning of MAC addresses is disabled and one generic entry for each peer is added in the forwarding database: transmitted packets are broadcasted to all peers, notably VRRP packets. Have a look at VXLAN & Linux for additional details.
vrrp_instance gateway1  
  state BACKUP
  interface keepalived6
  mcast_src_ip 2001:db8::10
  virtual_router_id 12
  priority 101
  virtual_ipaddress  
    2001:db8:ff/64 dev lo
   
 
Starting from Keepalived 2.21, unicast_peer can be used without the interface directive. I think using VXLAN is still a neat trick applicable to other situations where communication using broadcast or multicast is needed, while the underlying network provide no support for this.

Next.

Previous.